Projektarbeit - Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)¶

Dataset "Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)"

_About – Inside Airbnb_

1. Theory 2. Data Understanding 3. Data Preparation 4. Visualisierung 5. Dash 6. NLP/Spacy

Kurs: Data Analytics Dozent: Axel Wemmel Teilnehmer: Natalia Mokeeva Abgabedatum: 18.09.2025


Inhalt - Gliederung¶

  • Einleitung
  • Theorie
    • Explorative Datenanalyse
    • Data Mining
    • Daten
  • Vorab
    • Verwendete Bibliotheken
    • Logik
  • Data Understanding und Data Prepartion - New York
    • Erster Blick in die Daten
    • Zweiter Blick - Beispiel
    • Deskriptive Statistik
    • Fehlende Daten und Null-Werte
    • Spalte Price, Bathrooms, has_availability
    • Spalte estimated_revenue_l365d, host_acceptance_rate und beds untersuchen
    • Spalte review_*
    • Feature Engineering: Spalte host_since_year erstellen
  • Data Understanding und Data Prepartion - Rome
    • Erster Blick in die Daten
    • Zweiter Blick - Beispiel
    • Deskriptive Statistik
    • Fehlende Daten und Null-Werte
    • Spalte Price, Bathrooms, has_availability
    • Spalte estimated_revenue_l365d, host_acceptance_rate und beds untersuchen
    • Spalte review_*
    • Feature Engineering: Spalte host_since_year erstellen
  • EDA-Modell - New York
    • Univariat (Verteilungen)
    • Bivariat: Korrelation und Gruppenvergleiche
    • Multivariat: Pairplot und Preis-Modell (deskriptiv OLS Modell
    • Zeitliche Charachteristiks
  • Visualisierung der Listings (Plotly Express)
    • Geo-Visualisierung der Listings New York
    • Dash mit Diagramm und Schieberegler: New York
    • Dashboard: Komponenten nebeneinander zeigen: New York
    • Geo-Visualisierung der Listings Rome
    • Dash mit Diagramm und Schieberegler: Rome
    • Dashboard: Komponenten nebeneinander zeigen: Rome
    • Donut-Kreischarts: Lizenzstatus New York vs. Rome)
  • Natural Language Processing (Spacy)
    • Ausstattung Airbnb New York vs Airbnb Rome

Einleitung¶

Analysiert werden soll der Datensatz "Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)".

Daten Details:

  • Die Daten nutzen öffentlich zugängliche Informationen, die von der Airbnb-Website zusammengetragen wurden, einschließlich des Verfügbarkeitskalenders für 365 Tage in der Zukunft sowie der Bewertungen zu jedem Inserat.
  • Es werden keine „privaten“ Informationen verwendet. Namen, Fotos, Inserate und Bewertungsdetails werden sämtlich öffentlich auf der Airbnb-Website angezeigt.
  • Standortinformationen zu Inseraten werden von Airbnb anonymisiert. In der Praxis bedeutet das, dass der in der Karte bzw. in den Daten angegebene Standort eines Inserats bis zu 150 Meter vom tatsächlichen Adresspunkt verschoben ist.
  • Die Viertelnamen für jedes Inserat werden ermittelt, indem die geografischen Koordinaten des Inserats mit der städtischen Definition der Viertel abgeglichen werden. Die von Airbnb vergebenen Viertelnamen werden aufgrund ihrer Ungenauigkeiten nicht verwendet.

"listings.csv" New York City und "listings.csv Rome, Lazio", welcher Informationen von der Airbnb-Website aus dem Jahr 2025 darstellt. Nach einem etwas genaueren Blick auf die Daten und ihre Bedeutung dahinter, werden die einzelnen Phasen des Deskriptive Statistik und des EDA-Modells genauer erläutert.

Anschließend werden die einzelnen Phasen "Data Understanding" und "Data Preparation" praktisch erläutert und aufbereitet. Mit einer Säuberung der Daten (u.a. der Spalten "property_type", "neighbourhood_group_cleansed" und "price" ist eine speziellere Analyse möglich: Dies betrifft im Anschluss v.a. Diagramm und Schieberegler: Airbnb NYC – Jahresfilter: Verfügbarkeit vs Preis in einem interaktiven Dashboard.

Noch ein paar Worte zu dem Dataset:

  • Beschreibung des Erstellers:

Inside Airbnb ist ein gemeinwohlorientiertes Projekt. Es sammelt Daten zu Airbnb zur Analyse der Wirkung auf Wohngebiete. _About – Inside Airbnb_

Info:

  • Die hier angegebenen Dateitypen, Formate, etc. werden im weiteren Verlauf - für eine bessere Bearbeitung - angepasst werden.
  • Das Datenset kann hier heruntergeladen werden:
    • Get the Data*

Datenwörterbuch:

  • Inside Airbnb Data Dictionary (Google Sheet)
Abschnitt Beschreibung Spalten (Felder) Einheiten / Hinweise (laut Data Dictionary)
Zeitebene Momentaufnahme (Snapshot) des Bestands zum Erfassungszeitpunkt. last_scraped Datetime (UTC) – Zeitpunkt, an dem das Listing „gescraped“ wurde.
Analyse-Einheiten (Granularität) Ebenen, auf denen ausgewertet/aggregiert wird. Listing: id • Host: host_id • Nachbarschaft: neighbourhood_cleansed, neighbourhood_group_cleansed • Zeit: host_since, first_review, last_review id (integer) = eindeutige Listing-ID • host_id (integer) = Host-ID • host_since (date) = Konto-Erstellungsdatum • first_review/last_review (date) = erste/letzte Rezension.
Wichtige Maßeinheiten (Markt/Qualität) Zentrale Messgrößen je Listing. Preis: price • Verfügbarkeit: availability_30, availability_60, availability_90, availability_365, has_availability • Bewertungen: number_of_reviews, reviews_per_month, review_scores_* • Kapazität/Ausstattung: accommodates, bedrooms, beds, bathrooms, bathrooms_text • Buchung: instant_bookable price (currency) = täglicher Preis in lokaler Währung; $ im Export ist ein technisches Artefakt (ignorieren). • availability_x (integer) = verfügbare Nächte in den nächsten x Tagen; „nicht verfügbar“ kann gebucht oder geblockt bedeuten. • has_availability (boolean) = t/f. • number_of_reviews (integer). • reviews_per_month = Durchschnitt Rezensionen/Monat über die Lebenszeit (vgl. Pseudocode im Dictionary). • review_scores_* (numerisch, Airbnb-Scores). • accommodates (integer) = Gäste. • bathrooms (numeric), bathrooms_text (string, textuelle Angabe, z. B. „1.5 shared baths“). • instant_bookable (boolean) = t/f, ob automatische Buchung möglich ist.
Listing-Ebene – Kernmetriken Attribute pro Inserat (Anzeige). property_type, room_type, accommodates, bedrooms, beds, bathrooms/bathrooms_text, price, has_availability, availability_30/60/90/365, number_of_reviews, reviews_per_month, review_scores_*, instant_bookable, host_is_superhost, host_listings_count, host_since, Lage: neighbourhood_cleansed, neighbourhood_group_cleansed, latitude, longitude Typen laut Dictionary: property_type, room_type, neighbourhood_* (text); latitude/longitude (numeric); host_is_superhost (boolean, t/f); host_listings_count (integer).
Abgeleitete Kennzahlen (nützlich, nicht original) Aus Rohfeldern berechnet (für Analysen). occupied_nights_365, revpar_proxy, price_per_guest, host_tenure_days Belegte Nächte (Proxy): occupied_nights_365 = 365 − availability_365 (Nächte/Jahr) • RevPAR (Proxy): estimated_revenue_l365d / 365 • Preis/Gast: price / accommodates • Host-Tenure (Tage): last_scraped − host_since.

Theorie¶

Explorative Datenanalyse¶

  • Die explorative Datenanalyse wird auch als explorative Statistik bezeichnet und bildet ein Teilgebiet der Statistik. Es werden Daten analysiert, zu denen kaum oder sogar keine bekannten Zusammenhänge bestehen.
  • Dabei besteht das Ziel der explorativen Datenanalyse darin, einen Überblick über die vorliegenden Daten zu erhalten, um diese besser einschätzen und bewerten zu können. Erst im Rahmen der Analyse fallen Regelmäßigkeiten, Abhängigkeiten oder besondere Zusammenhänge zwischen scheinbar vollkommen ungleichen Daten auf.
  • Wurden ausreichend Muster und Zusammenhänge entdeckt, um eine Aussage zu treffen, wird anschließend gezielt nach Ausreißern in dieser Masse gesucht. Die einzelnen Ausreißer werden wiederum analysiert, um zu entscheiden, ob sie aus dem Datenpool entfernt werden können oder sogar die Analyse noch weiter verfeinert werden muss.
  • Fehlende Werte werden ebenfalls analysiert. Diese weisen auf mögliche Probleme während der Datenerhebung hin.

Data Mining¶

Was ist Data Mining?

  • Data Mining ist der Prozess der Extraktion nützlicher Informationen aus einer Ansammlung von Daten, oft aus einem Data Warehouse oder einer Reihe von verknüpften Datensätzen.
  • Data-Mining-Tools umfassen leistungsstarke statistische, mathematische und analytische Funktionen.
  • Ihre primäre Aufgabe ist die Analyse großer Datenmengen, um Trends, Muster und Beziehungen zu erkennen, die eine fundierte Entscheidungsfindung und Planung ermöglichen.

Generelles Vorgehen:

  1. Verstehen des Problems
  2. Sammeln der Daten
  3. Aufbereitung der Daten und Verständnis
  4. Schulen der Anwender

Daten¶

Datenbereinigung¶

  • Datenbereinigung, auch Data Cleaning, Data Cleansing oder Data Scrubbing genannt, ist der Prozess der Korrektur falscher, unvollständiger, doppelter oder anderweitig fehlerhafter Daten in einem Datensatz.
  • Es geht darum, Datenfehler zu identifizieren und dann Daten zu ändern, zu aktualisieren oder zu entfernen, um sie zu korrigieren. Datenbereinigung verbessert die Datenqualität und trägt dazu bei, genauere, konsistentere und zuverlässigere Informationen für die Entscheidungsfindung in einem Unternehmen bereitzustellen.

Schritte:

  1. Inspektion und Profiling
  2. Bereinigung
  3. Verifizierung
  4. Reporting

Die bereinigten Daten können dann in die verbleibenden Phasen der Datenaufbereitung überführt werden, beginnend mit der Datenstrukturierung und Datentransformation, um sie für die Analysezwecke vorzubereiten.

Vorab¶

Bevor der Datensatz analysiert und vorbereitet wird, werden in diesem Kapitel einige Vorab-Informationen gegeben, welche ein Gesamtverständnis für den Leser einfacher machen.

Verwendete Bibliotheken¶

Neben den Standardbibliotheken:

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • spacy

werden ebenso folgende Bibiliotheken verwendet:

  • re (Regular expression operations: Extrahieren von String-parts in Spalten)
  • dash bzw. plotly express(Visualisieren von Daten in Dashboards)
  • IPython.display

Installiert wurden die Pakete über pip, mit z.B.:

  • pip install spacy
  • pip install statsmodels
  • python -m spacy download en_core_web_sm das spacy Standardmodell

Logik (Dateien, Sicherungen)¶

Der gegebene Datensatz "Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)" ist in vielerlei Hinsicht nicht sauber: Es existieren Null-Values, Rechtschreibfehler , es sind keine Werte für manche spalten gegeben und die Datenformate müssen teils konvertiert werden.

Um während des Bearbeitungsprozesses keine relevanten Daten zu verlieren - und auch eine Performance des Jupyter Notebooks zu gewährleisten - wurden Sicherungen nach relevanten Bearbeitungsschritten durchgeführt. Es existieren die Sicherungsdateien (in csv-Format):

  • data_clean

Data Understanding und Data Prepartion - New York¶

Im Folgenden wird die CRIPS Phase des Data Understandings im praktischen durchgeführt: Wir bekommen einen Einblick in die Struktur und den Aufbau des Datensatzes.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.formula.api as smf
import seaborn as sns
import spacy

import os, re, webbrowser

from dash import Dash, html, dash_table, dcc, callback, Output, Input
from IPython.display import HTML, IFrame, display
from plotly.subplots import make_subplots
from spacy import displacy

Erster Blick in die Daten¶

Zu Beginn wird ein Ersteindruck der Daten gewonnen:

  • Wie sieht der Datensatz gesamt aus, welche Werte stehen oben, welche unten, was für Datentypen sind vorherrschend?
  • Ebenso werden die einzelnen Spalten etwas genauer angesehen, aufgeteilt u.a. in numerische und nicht-numerische Spalten.
  • Anschließend werden die numerischen Spalten genauer unter die Lupe genommen: Wie viele sind vorhanden, wie sehen das arithmetische Mittel, die Standardabweichung, die Quantile als auch die Max-Werte aus?
In [2]:
# New York: Datensatz einlesen + erster Blick in die Daten
file_NY = "data/NY/listings.csv" 
data_NY = pd.read_csv(file_NY, low_memory=False)

# Head von data ansehen
data_NY.head(2)
Out[2]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2539 https://www.airbnb.com/rooms/2539 20250617032754 2025-06-17 city scrape Superfast Wi-Fi. Clean & quiet home by the park Bright, serene room in a renovated apartment h... Close to Prospect Park and Historic Ditmas Park https://a0.muscache.com/pictures/hosting/Hosti... 2787 ... 5.0 4.75 4.88 NaN f 6 1 5 0 0.08
1 2595 https://www.airbnb.com/rooms/2595 20250617032754 2025-06-17 city scrape Skylit Midtown Manhattan Spacious Studio Beautiful, spacious skylit studio in the heart... Centrally located in the heart of Manhattan ju... https://a0.muscache.com/pictures/hosting/Hosti... 2845 ... 4.8 4.81 4.40 NaN f 3 3 0 0 0.26

2 rows × 79 columns

In [3]:
# Check DataFrame
type(data_NY)
Out[3]:
pandas.core.frame.DataFrame
In [4]:
# tail von data ansehen
data_NY.tail(2)
Out[4]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
36320 1444497972042807274 https://www.airbnb.com/rooms/1444497972042807274 20250617032754 2025-06-17 city scrape Steps from the Empire State, Iconic Views, 2 Beds Welcome to your dream Manhattan escape — a stu... NaN https://a0.muscache.com/pictures/hosting/Hosti... 677540330 ... NaN NaN NaN NaN f 2 2 0 0 NaN
36321 1444533606768940303 https://www.airbnb.com/rooms/1444533606768940303 20250617032754 2025-06-17 city scrape 2BR/Brooklyn: Private Backyard Spacious 2BR Brooklyn home in beautiful Bed-St... NaN https://a0.muscache.com/pictures/miso/Hosting-... 10931359 ... NaN NaN NaN NaN f 1 1 0 0 NaN

2 rows × 79 columns

In [5]:
# Genereller shape des Datensatzes
data_NY.shape
Out[5]:
(36322, 79)
In [6]:
# New York: Genauere Infos zu Spalten, Datentypen
data_NY.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36322 entries, 0 to 36321
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            36322 non-null  int64  
 1   listing_url                                   36322 non-null  object 
 2   scrape_id                                     36322 non-null  int64  
 3   last_scraped                                  36322 non-null  object 
 4   source                                        36322 non-null  object 
 5   name                                          36320 non-null  object 
 6   description                                   35374 non-null  object 
 7   neighborhood_overview                         19084 non-null  object 
 8   picture_url                                   36322 non-null  object 
 9   host_id                                       36322 non-null  int64  
 10  host_url                                      36322 non-null  object 
 11  host_name                                     36307 non-null  object 
 12  host_since                                    36307 non-null  object 
 13  host_location                                 28913 non-null  object 
 14  host_about                                    20956 non-null  object 
 15  host_response_time                            21550 non-null  object 
 16  host_response_rate                            21550 non-null  object 
 17  host_acceptance_rate                          21720 non-null  object 
 18  host_is_superhost                             35830 non-null  object 
 19  host_thumbnail_url                            36307 non-null  object 
 20  host_picture_url                              36307 non-null  object 
 21  host_neighbourhood                            29034 non-null  object 
 22  host_listings_count                           36307 non-null  float64
 23  host_total_listings_count                     36307 non-null  float64
 24  host_verifications                            36307 non-null  object 
 25  host_has_profile_pic                          36307 non-null  object 
 26  host_identity_verified                        36307 non-null  object 
 27  neighbourhood                                 19085 non-null  object 
 28  neighbourhood_cleansed                        36322 non-null  object 
 29  neighbourhood_group_cleansed                  36322 non-null  object 
 30  latitude                                      36322 non-null  float64
 31  longitude                                     36322 non-null  float64
 32  property_type                                 36322 non-null  object 
 33  room_type                                     36322 non-null  object 
 34  accommodates                                  36322 non-null  int64  
 35  bathrooms                                     21682 non-null  float64
 36  bathrooms_text                                36286 non-null  object 
 37  bedrooms                                      30274 non-null  float64
 38  beds                                          21620 non-null  float64
 39  amenities                                     36322 non-null  object 
 40  price                                         21459 non-null  object 
 41  minimum_nights                                36322 non-null  int64  
 42  maximum_nights                                36322 non-null  int64  
 43  minimum_minimum_nights                        36322 non-null  int64  
 44  maximum_minimum_nights                        36322 non-null  int64  
 45  minimum_maximum_nights                        36322 non-null  int64  
 46  maximum_maximum_nights                        36322 non-null  int64  
 47  minimum_nights_avg_ntm                        36322 non-null  float64
 48  maximum_nights_avg_ntm                        36322 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              30705 non-null  object 
 51  availability_30                               36322 non-null  int64  
 52  availability_60                               36322 non-null  int64  
 53  availability_90                               36322 non-null  int64  
 54  availability_365                              36322 non-null  int64  
 55  calendar_last_scraped                         36322 non-null  object 
 56  number_of_reviews                             36322 non-null  int64  
 57  number_of_reviews_ltm                         36322 non-null  int64  
 58  number_of_reviews_l30d                        36322 non-null  int64  
 59  availability_eoy                              36322 non-null  int64  
 60  number_of_reviews_ly                          36322 non-null  int64  
 61  estimated_occupancy_l365d                     36322 non-null  int64  
 62  estimated_revenue_l365d                       21459 non-null  float64
 63  first_review                                  25171 non-null  object 
 64  last_review                                   25171 non-null  object 
 65  review_scores_rating                          25171 non-null  float64
 66  review_scores_accuracy                        25161 non-null  float64
 67  review_scores_cleanliness                     25170 non-null  float64
 68  review_scores_checkin                         25157 non-null  float64
 69  review_scores_communication                   25165 non-null  float64
 70  review_scores_location                        25154 non-null  float64
 71  review_scores_value                           25155 non-null  float64
 72  license                                       5329 non-null   object 
 73  instant_bookable                              36322 non-null  object 
 74  calculated_host_listings_count                36322 non-null  int64  
 75  calculated_host_listings_count_entire_homes   36322 non-null  int64  
 76  calculated_host_listings_count_private_rooms  36322 non-null  int64  
 77  calculated_host_listings_count_shared_rooms   36322 non-null  int64  
 78  reviews_per_month                             25171 non-null  float64
dtypes: float64(19), int64(24), object(36)
memory usage: 21.9+ MB
In [7]:
# Datentypen
data_NY.dtypes
Out[7]:
id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
source                                           object
                                                 ...   
calculated_host_listings_count                    int64
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
Length: 79, dtype: object
In [8]:
# Alle Columns anzeigen lassen
data_NY.columns
Out[8]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy',
       'number_of_reviews_ly', 'estimated_occupancy_l365d',
       'estimated_revenue_l365d', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object')
In [9]:
# New York: Aufspaltung in numerische und nicht-numerische Spalten

# Numerische Spalten
data_numeric = data_NY.select_dtypes(include=[np.number])
numeric_cols = data_numeric.columns.values
print("\n", "1 - Numerische Spalten: ", "\n")
print(numeric_cols)

# Nicht-numerische Spalten
data_non_numeric = data_NY.select_dtypes(exclude=[np.number])
non_numeric_cols = data_non_numeric.columns.values
print("\n", "2 - Nicht-numerische Spalten: ", "\n")
print(non_numeric_cols)
 1 - Numerische Spalten:  

['id' 'scrape_id' 'host_id' 'host_listings_count'
 'host_total_listings_count' 'latitude' 'longitude' 'accommodates'
 'bathrooms' 'bedrooms' 'beds' 'minimum_nights' 'maximum_nights'
 'minimum_minimum_nights' 'maximum_minimum_nights'
 'minimum_maximum_nights' 'maximum_maximum_nights'
 'minimum_nights_avg_ntm' 'maximum_nights_avg_ntm' 'calendar_updated'
 'availability_30' 'availability_60' 'availability_90' 'availability_365'
 'number_of_reviews' 'number_of_reviews_ltm' 'number_of_reviews_l30d'
 'availability_eoy' 'number_of_reviews_ly' 'estimated_occupancy_l365d'
 'estimated_revenue_l365d' 'review_scores_rating' 'review_scores_accuracy'
 'review_scores_cleanliness' 'review_scores_checkin'
 'review_scores_communication' 'review_scores_location'
 'review_scores_value' 'calculated_host_listings_count'
 'calculated_host_listings_count_entire_homes'
 'calculated_host_listings_count_private_rooms'
 'calculated_host_listings_count_shared_rooms' 'reviews_per_month']

 2 - Nicht-numerische Spalten:  

['listing_url' 'last_scraped' 'source' 'name' 'description'
 'neighborhood_overview' 'picture_url' 'host_url' 'host_name' 'host_since'
 'host_location' 'host_about' 'host_response_time' 'host_response_rate'
 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url'
 'host_picture_url' 'host_neighbourhood' 'host_verifications'
 'host_has_profile_pic' 'host_identity_verified' 'neighbourhood'
 'neighbourhood_cleansed' 'neighbourhood_group_cleansed' 'property_type'
 'room_type' 'bathrooms_text' 'amenities' 'price' 'has_availability'
 'calendar_last_scraped' 'first_review' 'last_review' 'license'
 'instant_bookable']
In [10]:
# Anzahl Numerische Spalten
data_NY.select_dtypes(include="number").shape[1]
Out[10]:
43
In [11]:
# Anzahl Kategoriale Spalten
data_NY.select_dtypes(include=["object"]).shape[1]
Out[11]:
36
In [12]:
# New York: Numerische Werte genauer ansehen
data_NY.describe().T
Out[12]:
count mean std min 25% 50% 75% max
id 36322.0 4.382216e+17 5.199100e+17 2.539000e+03 2.121091e+07 5.004978e+07 9.277467e+17 1.444534e+18
scrape_id 36322.0 2.025062e+13 0.000000e+00 2.025062e+13 2.025062e+13 2.025062e+13 2.025062e+13 2.025062e+13
host_id 36322.0 1.713590e+08 1.893709e+08 1.678000e+03 1.764159e+07 8.336440e+07 3.036223e+08 7.005911e+08
host_listings_count 36307.0 2.441627e+02 9.666283e+02 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+01 4.925000e+03
host_total_listings_count 36307.0 3.406701e+02 1.240830e+03 1.000000e+00 1.000000e+00 3.000000e+00 1.400000e+01 9.109000e+03
latitude 36322.0 4.072856e+01 5.648223e-02 4.050037e+01 4.068828e+01 4.072594e+01 4.076235e+01 4.091139e+01
longitude 36322.0 -7.394697e+01 5.525153e-02 -7.425191e+01 -7.398326e+01 -7.395471e+01 -7.392785e+01 -7.371182e+01
accommodates 36322.0 2.751886e+00 1.891702e+00 1.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01
bathrooms 21682.0 1.188382e+00 5.544393e-01 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.550000e+01
bedrooms 30274.0 1.383332e+00 9.381921e-01 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.500000e+01
beds 21620.0 1.636494e+00 1.193949e+00 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+01
minimum_nights 36322.0 2.920572e+01 3.511100e+01 1.000000e+00 3.000000e+01 3.000000e+01 3.000000e+01 1.124000e+03
maximum_nights 36322.0 6.024392e+04 1.126843e+07 1.000000e+00 1.200000e+02 3.650000e+02 1.125000e+03 2.147484e+09
minimum_minimum_nights 36322.0 2.904061e+01 3.496167e+01 1.000000e+00 3.000000e+01 3.000000e+01 3.000000e+01 1.124000e+03
maximum_minimum_nights 36322.0 3.843604e+01 6.464567e+01 1.000000e+00 3.000000e+01 3.000000e+01 3.000000e+01 1.124000e+03
minimum_maximum_nights 36322.0 2.968219e+05 2.519472e+07 1.000000e+00 3.600000e+02 3.650000e+02 1.125000e+03 2.147484e+09
maximum_maximum_nights 36322.0 8.880841e+05 4.363228e+07 1.000000e+00 3.650000e+02 7.300000e+02 1.125000e+03 2.147484e+09
minimum_nights_avg_ntm 36322.0 3.000185e+01 3.757597e+01 1.000000e+00 3.000000e+01 3.000000e+01 3.000000e+01 1.124000e+03
maximum_nights_avg_ntm 36322.0 5.537397e+05 3.055266e+07 1.000000e+00 3.650000e+02 5.791000e+02 1.125000e+03 2.147484e+09
calendar_updated 0.0 NaN NaN NaN NaN NaN NaN NaN
availability_30 36322.0 8.725125e+00 1.180300e+01 0.000000e+00 0.000000e+00 0.000000e+00 1.800000e+01 3.000000e+01
availability_60 36322.0 2.072482e+01 2.406532e+01 0.000000e+00 0.000000e+00 7.000000e+00 4.600000e+01 6.000000e+01
availability_90 36322.0 3.509735e+01 3.606974e+01 0.000000e+00 0.000000e+00 2.600000e+01 7.500000e+01 9.000000e+01
availability_365 36322.0 1.590105e+02 1.455351e+02 0.000000e+00 0.000000e+00 1.530000e+02 3.100000e+02 3.650000e+02
number_of_reviews 36322.0 2.685304e+01 6.738206e+01 0.000000e+00 0.000000e+00 3.000000e+00 2.300000e+01 3.277000e+03
number_of_reviews_ltm 36322.0 4.017262e+00 2.008151e+01 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.786000e+03
number_of_reviews_l30d 36322.0 3.030395e-01 1.483196e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.530000e+02
availability_eoy 36322.0 8.645298e+01 8.021589e+01 0.000000e+00 0.000000e+00 8.500000e+01 1.690000e+02 1.980000e+02
number_of_reviews_ly 36322.0 3.664886e+00 1.911530e+01 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.797000e+03
estimated_occupancy_l365d 36322.0 4.784547e+01 8.546655e+01 0.000000e+00 0.000000e+00 0.000000e+00 6.000000e+01 2.550000e+02
estimated_revenue_l365d 21459.0 1.523824e+04 9.763701e+04 0.000000e+00 0.000000e+00 0.000000e+00 1.851000e+04 1.275000e+07
review_scores_rating 25171.0 4.725706e+00 4.508312e-01 0.000000e+00 4.650000e+00 4.860000e+00 5.000000e+00 5.000000e+00
review_scores_accuracy 25161.0 4.765266e+00 4.377228e-01 0.000000e+00 4.710000e+00 4.900000e+00 5.000000e+00 5.000000e+00
review_scores_cleanliness 25170.0 4.658376e+00 4.999087e-01 0.000000e+00 4.540000e+00 4.810000e+00 5.000000e+00 5.000000e+00
review_scores_checkin 25157.0 4.835495e+00 3.758870e-01 0.000000e+00 4.820000e+00 4.950000e+00 5.000000e+00 5.000000e+00
review_scores_communication 25165.0 4.824200e+00 4.155085e-01 0.000000e+00 4.820000e+00 4.960000e+00 5.000000e+00 5.000000e+00
review_scores_location 25154.0 4.744684e+00 3.895678e-01 0.000000e+00 4.650000e+00 4.850000e+00 5.000000e+00 5.000000e+00
review_scores_value 25155.0 4.640968e+00 4.918495e-01 0.000000e+00 4.530000e+00 4.760000e+00 4.940000e+00 5.000000e+00
calculated_host_listings_count 36322.0 6.070415e+01 1.950884e+02 1.000000e+00 1.000000e+00 1.000000e+00 8.000000e+00 1.054000e+03
calculated_host_listings_count_entire_homes 36322.0 4.129800e+01 1.801254e+02 0.000000e+00 0.000000e+00 1.000000e+00 2.000000e+00 1.054000e+03
calculated_host_listings_count_private_rooms 36322.0 1.854292e+01 8.345715e+01 0.000000e+00 0.000000e+00 1.000000e+00 2.000000e+00 5.920000e+02
calculated_host_listings_count_shared_rooms 36322.0 3.317549e-02 5.371839e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+01
reviews_per_month 25171.0 8.172615e-01 1.850794e+00 1.000000e-02 8.000000e-02 2.600000e-01 9.400000e-01 1.219700e+02
In [13]:
# New York: Describe von Spalte Location
data_NY["bathrooms"].describe()
Out[13]:
count    21682.000000
mean         1.188382
std          0.554439
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         15.500000
Name: bathrooms, dtype: float64
In [14]:
# New York: Describe von Spalte Summary
data_NY["availability_365"].describe()
Out[14]:
count    36322.000000
mean       159.010545
std        145.535113
min          0.000000
25%          0.000000
50%        153.000000
75%        310.000000
max        365.000000
Name: availability_365, dtype: float64
In [15]:
# New York: Describe von Spalte 
data_NY["host_since"].describe()
Out[15]:
count          36307
unique          5159
top       2016-12-16
freq            1060
Name: host_since, dtype: object
In [16]:
# Datums-Typen bearbeiten: object -> datetime
if "host_since" in data_NY.columns:
    data_NY["host_since"] = pd.to_datetime(data_NY["host_since"], errors="coerce")

data_NY["host_since"].describe()
Out[16]:
count                            36307
mean     2017-03-13 15:15:35.522075904
min                2008-08-11 00:00:00
25%                2014-07-07 00:00:00
50%                2016-07-14 00:00:00
75%                2019-10-21 00:00:00
max                2025-06-10 00:00:00
Name: host_since, dtype: object
In [17]:
### Zweiter Blick - Beispiele: Verfügbarkeit, Ausstattung, Host<a class="anchor" id="4.2"></a>
In [18]:
# Blick in die Daten
# Ausstattung New York ("amenities") von bestimmten loc(ation) lesen
data_NY["amenities"].loc[3010]
Out[18]:
'["Pets allowed", "Essentials", "Kitchen", "First aid kit", "Smoke alarm", "TV with standard cable", "Elevator", "Wifi", "Carbon monoxide alarm", "Air conditioning", "Heating"]'
In [19]:
# Blick in die Daten
# Absturz von Ausstattung New York ("amenities") oben
data_NY.loc[3010]
Out[19]:
id                                                                           6179500
listing_url                                     https://www.airbnb.com/rooms/6179500
scrape_id                                                             20250617032754
last_scraped                                                              2025-06-18
source                                                               previous scrape
                                                                ...                 
calculated_host_listings_count                                                     1
calculated_host_listings_count_entire_homes                                        0
calculated_host_listings_count_private_rooms                                       1
calculated_host_listings_count_shared_rooms                                        0
reviews_per_month                                                               0.01
Name: 3010, Length: 79, dtype: object
In [20]:
# Blick in die Daten
# "host_since" - type object
# Datums-Typen bearbeiten
data_NY["host_since"].head(1)
Out[20]:
0   2008-09-07
Name: host_since, dtype: datetime64[ns]
In [21]:
# Blick in die Daten
if "host_since" in data_NY.columns:
    data_NY["host_since"] = pd.to_datetime(data_NY["host_since"], errors="coerce")
data_NY["host_since"].head(10)
Out[21]:
0   2008-09-07
1   2008-09-09
2   2009-02-03
3   2009-05-06
4   2009-05-07
5   2009-05-12
6   2009-05-15
7   2009-05-17
8   2010-08-17
9   2010-05-20
Name: host_since, dtype: datetime64[ns]
In [22]:
# Blick in die Daten
# Hosts seit "2016-05-20"
data_NY.loc[data_NY["host_since"]=="2016-05-20"][["name", "host_since"]]
Out[22]:
name host_since
6229 On Fifth, Across From Central Park 2016-05-20
6256 2br 1 bath FURN kit MIN 1 month avail Feb1-28 2016-05-20
6257 STUNNING ONE BEDROOM IN THE HEART OF NEW YORK ... 2016-05-20
8822 Spacious Brooklyn Bedroom Close to Manhattan e... 2016-05-20
11187 Midtown West 2016-05-20
12427 2 Bed / 2 Bath Apartment in Beautiful Bed-Stuy 2016-05-20
14707 2 BR, 2BA in prime brownstone Brooklyn location 2016-05-20
15318 Boho Chic in The Bronx / The Green Room 2016-05-20
16719 Boho Chic in The Bronx / The Gold Room 2016-05-20
18912 Boho Chic in The Bronx/The Blue Room 2016-05-20
23514 Luxe Brooklyn Suite 2016-05-20
In [23]:
# Blick in die Daten
# Top 5 der höchsten Verfügbarkeit (availability_365)
data_5first_availability_365 = data_NY.sort_values(by=["availability_365"], ascending=False).head(5)
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]]
Out[23]:
name availability_365 host_since neighbourhood_group_cleansed
22532 NYC Astoria-LaGuardia #4 TinyPrivateRoom Singl... 365 2019-10-17 Queens
25501 Luxury & design Townhouse - NY 365 2019-10-17 Brooklyn
15253 Renovated Apartment with high ceilings! 365 2020-01-23 Brooklyn
25445 Large Room in Queens, 15 mins to Manhattan #485 365 2012-08-11 Queens
25446 Queens New HDTV Room, 15 min to Manhattan #486 365 2012-08-11 Queens
In [24]:
# Blick in die Daten
# Top 5 der niedrigsten Verfügbarkeit (availability_365)
data_5last_availability_365 = data_NY.sort_values(by=["availability_365"], ascending=False).tail(5)
data_5last_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]]
Out[24]:
name availability_365 host_since neighbourhood_group_cleansed
11578 BEAUTIFUL Water View Aparment in Financial 0 2018-10-31 Manhattan
11577 Parkway Abode 0 2013-12-27 Brooklyn
11575 QUIET Zen Garden Pad in BEST Brooklyn location! 0 2010-12-21 Brooklyn
11573 Cozy NYC Downtown Apartment 0 2014-03-21 Manhattan
8769 Penthouse Apt with Incredible Views & Location 0 2014-08-30 Manhattan
In [25]:
# Blick in die Daten
# Top 5 "availability_365" mit style.bar

data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]].style.bar(subset=["availability_365"])
Out[25]:
  name availability_365 host_since neighbourhood_group_cleansed
22532 NYC Astoria-LaGuardia #4 TinyPrivateRoom SingleBed 365 2019-10-17 00:00:00 Queens
25501 Luxury & design Townhouse - NY 365 2019-10-17 00:00:00 Brooklyn
15253 Renovated Apartment with high ceilings! 365 2020-01-23 00:00:00 Brooklyn
25445 Large Room in Queens, 15 mins to Manhattan #485 365 2012-08-11 00:00:00 Queens
25446 Queens New HDTV Room, 15 min to Manhattan #486 365 2012-08-11 00:00:00 Queens
In [26]:
# Blick in die Daten

# Check Summe von number_of_reviews
print("number_of_reviews: ", data_NY["number_of_reviews"].sum())

# Check Summe von  property_type
print("property_type: ", len(data_NY["property_type"].unique()))

# Check Summe von neighbourhood_group_cleansed
print("neighbourhood_group_cleansed: ", len(data_NY["neighbourhood_group_cleansed"].unique()))
number_of_reviews:  975356
property_type:  74
neighbourhood_group_cleansed:  5
In [27]:
# Investigate Room Type
data_NY["room_type"].unique()
Out[27]:
array(['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'],
      dtype=object)

Deskriptive Statistik:¶

In diesem Teil werden wir uns den folgenden Kernthemen zuwenden:

  • Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
  • Konvertierung der "Date"-Spalte und boolische Spalte
  • Behandlung von Null-Values
In [29]:
# Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung

# Spalte rehalten Info über lätzte Scrape
# calculated_host_listings_count 
# calculated_host_listings_count_entire_homes 
# calculated_host_listings_count_private_rooms 
# calculated_host_listings_count_shared_rooms 

# Spalte "host_name" - Name of the host. Usually just the first name(s).

# Spalten "host_listings_count", "host_total_listings_count": The number of listings the host has (per Airbnb unknown calculations)

# Spalte brauche für Analyse nicht:
drop_columns = ["listing_url", 
                "scrape_id", 
                "last_scraped", 
                "source",
                "description",
                "neighborhood_overview", 
                "picture_url",
                "host_url",
                "host_name",
                "host_about",
                "host_thumbnail_url",
                "host_picture_url",
                "host_listings_count",
                "host_total_listings_count",
                "host_has_profile_pic",
                "calendar_last_scraped",
                "calculated_host_listings_count",
                "calculated_host_listings_count_entire_homes",
                "calculated_host_listings_count_private_rooms",
                "calculated_host_listings_count_shared_rooms"
                ]

df_NY = data_NY.drop(columns=drop_columns, axis=1, errors="ignore")
In [30]:
print(df_NY[["host_identity_verified","instant_bookable","has_availability"]])

# Bool-Typen bearbeiten
def to_bool(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip().lower()
    if s == "t" or x == True: return True
    if s == "f" or x == False:  return False
    return pd.NA

col_bool = ["host_identity_verified","instant_bookable","has_availability"]
    
for c in ["host_identity_verified", "instant_bookable","has_availability"]:
    if c in df_NY.columns:
        df_NY[c] = df_NY[c].map(to_bool).astype("boolean")
      host_identity_verified instant_bookable has_availability
0                          t                f                t
1                          t                f                t
2                          t                t                t
3                          t                f                t
4                          f                f                t
...                      ...              ...              ...
36317                      t                t                t
36318                      t                t                t
36319                      t                f                t
36320                      t                f                t
36321                      f                f                t

[36322 rows x 3 columns]
In [31]:
print(df_NY[["host_identity_verified","instant_bookable","has_availability"]].head())
   host_identity_verified  instant_bookable  has_availability
0                    True             False              True
1                    True             False              True
2                    True              True              True
3                    True             False              True
4                   False             False              True
In [32]:
# Spalte "license" - Werte [nan, 'Exempt', "OSE-*"]
print(df_NY["license"].unique()[:5])
[i for i in df_NY["license"].unique() if str(i)[:4].upper() != "OSE-"]
[nan 'OSE-STRREG-0000008' 'OSE-STRREG-0000923' 'OSE-STRREG-0000656'
 'OSE-STRREG-0000108']
Out[32]:
[nan, 'Exempt']
In [33]:
#Check for the missing values
df_NY.isna().any()
Out[33]:
id                              False
name                             True
host_id                         False
host_since                       True
host_location                    True
host_response_time               True
host_response_rate               True
host_acceptance_rate             True
host_is_superhost                True
host_neighbourhood               True
host_verifications               True
host_identity_verified           True
neighbourhood                    True
neighbourhood_cleansed          False
neighbourhood_group_cleansed    False
latitude                        False
longitude                       False
property_type                   False
room_type                       False
accommodates                    False
bathrooms                        True
bathrooms_text                   True
bedrooms                         True
beds                             True
amenities                       False
price                            True
minimum_nights                  False
maximum_nights                  False
minimum_minimum_nights          False
maximum_minimum_nights          False
minimum_maximum_nights          False
maximum_maximum_nights          False
minimum_nights_avg_ntm          False
maximum_nights_avg_ntm          False
calendar_updated                 True
has_availability                 True
availability_30                 False
availability_60                 False
availability_90                 False
availability_365                False
number_of_reviews               False
number_of_reviews_ltm           False
number_of_reviews_l30d          False
availability_eoy                False
number_of_reviews_ly            False
estimated_occupancy_l365d       False
estimated_revenue_l365d          True
first_review                     True
last_review                      True
review_scores_rating             True
review_scores_accuracy           True
review_scores_cleanliness        True
review_scores_checkin            True
review_scores_communication      True
review_scores_location           True
review_scores_value              True
license                          True
instant_bookable                False
reviews_per_month                True
dtype: bool
In [34]:
# Zusätchlive Funktionen

# Missing Values
missing = df_NY.isnull().sum().to_frame()

def missing_values(df):
    missing["Missing_values"] = df_NY.isnull().sum().to_frame()
    missing["Total_perc"] = (missing["Missing_values"] / len(df)) * 100
    return missing
In [35]:
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Out[35]:
name                             0.005506
host_since                       0.041297
host_location                   20.398106
host_response_time              40.669567
host_response_rate              40.669567
host_acceptance_rate            40.201531
host_is_superhost                1.354551
host_neighbourhood              20.064974
host_verifications               0.041297
host_identity_verified           0.041297
neighbourhood                   47.456087
bathrooms                       40.306151
bathrooms_text                   0.099113
bedrooms                        16.651065
beds                            40.476846
price                           40.920104
calendar_updated               100.000000
has_availability                15.464457
estimated_revenue_l365d         40.920104
first_review                    30.700402
last_review                     30.700402
review_scores_rating            30.700402
review_scores_accuracy          30.727933
review_scores_cleanliness       30.703155
review_scores_checkin           30.738946
review_scores_communication     30.716921
review_scores_location          30.747206
review_scores_value             30.744452
license                         85.328451
reviews_per_month               30.700402
Name: Total_perc, dtype: float64
In [36]:
# Spalte: "calendar_updated" Missing-Rate 100%
df_NY["calendar_updated"].unique()

# Spalte "neighbourhood" has nur zwei Werte: ['Neighborhood highlights', nan]
df_NY["neighbourhood"].unique()

# Missing-Rate > 40% and no description in the data Doctionary:
# host_is_superhost 
# host_response_time
# host_response_rate
Out[36]:
array(['Neighborhood highlights', nan], dtype=object)
In [37]:
drop_columns = ["host_response_time",
                "host_response_rate",
                "host_is_superhost",
                "neighbourhood",
                "calendar_updated",
                ]

df_NY = df_NY.drop(columns=drop_columns, axis=1, errors="ignore")
# print(df_NY.info())
In [38]:
# Median
df_NY.median(numeric_only=True)
Out[38]:
id                             50049775.5
host_id                        83364401.0
host_identity_verified                1.0
latitude                        40.725941
longitude                      -73.954711
accommodates                          2.0
bathrooms                             1.0
bedrooms                              1.0
beds                                  1.0
minimum_nights                       30.0
maximum_nights                      365.0
minimum_minimum_nights               30.0
maximum_minimum_nights               30.0
minimum_maximum_nights              365.0
maximum_maximum_nights              730.0
minimum_nights_avg_ntm               30.0
maximum_nights_avg_ntm              579.1
has_availability                      1.0
availability_30                       0.0
availability_60                       7.0
availability_90                      26.0
availability_365                    153.0
number_of_reviews                     3.0
number_of_reviews_ltm                 0.0
number_of_reviews_l30d                0.0
availability_eoy                     85.0
number_of_reviews_ly                  0.0
estimated_occupancy_l365d             0.0
estimated_revenue_l365d               0.0
review_scores_rating                 4.86
review_scores_accuracy                4.9
review_scores_cleanliness            4.81
review_scores_checkin                4.95
review_scores_communication          4.96
review_scores_location               4.85
review_scores_value                  4.76
instant_bookable                      0.0
reviews_per_month                    0.26
dtype: Float64
In [39]:
# Standard Deviation
df_NY.std(numeric_only=True)
Out[39]:
id                             519910010100084032.0
host_id                            189370891.502614
host_identity_verified                     0.332515
latitude                                   0.056482
longitude                                  0.055252
accommodates                               1.891702
bathrooms                                  0.554439
bedrooms                                   0.938192
beds                                       1.193949
minimum_nights                            35.111004
maximum_nights                      11268434.763247
minimum_minimum_nights                    34.961666
maximum_minimum_nights                    64.645669
minimum_maximum_nights              25194723.611818
maximum_maximum_nights              43632280.118655
minimum_nights_avg_ntm                    37.575971
maximum_nights_avg_ntm              30552660.280436
has_availability                                0.0
availability_30                           11.803004
availability_60                           24.065318
availability_90                           36.069741
availability_365                         145.535113
number_of_reviews                         67.382057
number_of_reviews_ltm                     20.081506
number_of_reviews_l30d                     1.483196
availability_eoy                          80.215889
number_of_reviews_ly                      19.115302
estimated_occupancy_l365d                  85.46655
estimated_revenue_l365d                97637.007729
review_scores_rating                       0.450831
review_scores_accuracy                     0.437723
review_scores_cleanliness                  0.499909
review_scores_checkin                      0.375887
review_scores_communication                0.415509
review_scores_location                     0.389568
review_scores_value                         0.49185
instant_bookable                           0.398836
reviews_per_month                          1.850794
dtype: Float64
In [40]:
# Mean of "accommodates"
df_NY["accommodates"].mean().round(2)
Out[40]:
np.float64(2.75)
In [41]:
# Mean of "minimum_nights" and "maximum_nights"
df_NY["minimum_nights"].mean().round(2), df_NY["maximum_nights"].mean().round(2) 
Out[41]:
(np.float64(29.21), np.float64(60243.92))
In [42]:
#IQR (Interquartile Range)
num = df_NY.select_dtypes(include=[np.floating, np.integer])
# num.quantile(0.50, numeric_only=True)
num.quantile(0.75) - num.quantile(0.25)
Out[42]:
id                             9.277467e+17
host_id                        2.859807e+08
latitude                       7.406750e-02
longitude                      5.540750e-02
accommodates                   2.000000e+00
bathrooms                      0.000000e+00
bedrooms                       1.000000e+00
beds                           1.000000e+00
minimum_nights                 0.000000e+00
maximum_nights                 1.005000e+03
minimum_minimum_nights         0.000000e+00
maximum_minimum_nights         0.000000e+00
minimum_maximum_nights         7.650000e+02
maximum_maximum_nights         7.600000e+02
minimum_nights_avg_ntm         0.000000e+00
maximum_nights_avg_ntm         7.600000e+02
availability_30                1.800000e+01
availability_60                4.600000e+01
availability_90                7.500000e+01
availability_365               3.100000e+02
number_of_reviews              2.300000e+01
number_of_reviews_ltm          1.000000e+00
number_of_reviews_l30d         0.000000e+00
availability_eoy               1.690000e+02
number_of_reviews_ly           1.000000e+00
estimated_occupancy_l365d      6.000000e+01
estimated_revenue_l365d        1.851000e+04
review_scores_rating           3.500000e-01
review_scores_accuracy         2.900000e-01
review_scores_cleanliness      4.600000e-01
review_scores_checkin          1.800000e-01
review_scores_communication    1.800000e-01
review_scores_location         3.500000e-01
review_scores_value            4.100000e-01
reviews_per_month              8.600000e-01
dtype: float64
In [43]:
df_NY.describe(include=['object', 'category'])
Out[43]:
name host_location host_acceptance_rate host_neighbourhood host_verifications neighbourhood_cleansed neighbourhood_group_cleansed property_type room_type bathrooms_text amenities price first_review last_review license
count 36320 28913 21720 29034 36307 36322 36322 36322 36322 36286 36322 21459 25171 25171 5329
unique 34732 985 100 535 7 223 5 74 4 33 29619 1101 4416 3322 1963
top Water View King Bed Hotel Room New York, NY 100% Bedford-Stuyvesant ['email', 'phone'] Bedford-Stuyvesant Manhattan Entire rental unit Entire home/apt 1 bath ["Washer", "Kitchen", "Smoke alarm", "TV", "Wi... $150.00 2023-01-01 2025-05-31 Exempt
freq 30 21784 5681 1900 28107 2615 16081 15335 19435 18449 225 318 61 410 3052

Fehlende Daten - Null-Werte¶

Mithilfe von isnull.().sum() erkennen wir, dass fast alle Spalten Null-Werte haben:

In [45]:
df_NY.isnull().sum()
Out[45]:
id                                  0
name                                2
host_id                             0
host_since                         15
host_location                    7409
host_acceptance_rate            14602
host_neighbourhood               7288
host_verifications                 15
host_identity_verified             15
neighbourhood_cleansed              0
neighbourhood_group_cleansed        0
latitude                            0
longitude                           0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                       14640
bathrooms_text                     36
bedrooms                         6048
beds                            14702
amenities                           0
price                           14863
minimum_nights                      0
maximum_nights                      0
minimum_minimum_nights              0
maximum_minimum_nights              0
minimum_maximum_nights              0
maximum_maximum_nights              0
minimum_nights_avg_ntm              0
maximum_nights_avg_ntm              0
has_availability                 5617
availability_30                     0
availability_60                     0
availability_90                     0
availability_365                    0
number_of_reviews                   0
number_of_reviews_ltm               0
number_of_reviews_l30d              0
availability_eoy                    0
number_of_reviews_ly                0
estimated_occupancy_l365d           0
estimated_revenue_l365d         14863
first_review                    11151
last_review                     11151
review_scores_rating            11151
review_scores_accuracy          11161
review_scores_cleanliness       11152
review_scores_checkin           11165
review_scores_communication     11157
review_scores_location          11168
review_scores_value             11167
license                         30993
instant_bookable                    0
reviews_per_month               11151
dtype: int64
In [46]:
# Anzahl unterschiedlicher Zeilen: 36322
# Keine Duplikate
df_NY.drop_duplicates().shape[0]
Out[46]:
36322

Spalten Price, Bathrooms, has_availability säubern¶

Spalte "price" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte 40.9%

Ziel: Die Spalte price sauber in Float konvertieren, Ausreißer entfernen und fehlende Werte hierarchischer Median-basiert imputieren.

Vorgehen:

  • Preis säubern: in Float umwandeln; Währungszeichen und Tausendertrennzeichen entfernen.
  • Unplausible Werte: price <= 0 oder sehr hoch (z. B. > 99,5%-Quantil) → NaN.
  • Check vor Imputation: Verteilungen/Kennzahlen vor dem Füllen prüfen.
  • Imputation: hierarchischer Median.
  • Check nach Imputation: Verteilungen/Kennzahlen nach dem Füllen prüfen.
In [48]:
print("Missing price before imputation:", df_NY["price"].isna().sum())
Missing price before imputation: 14863
In [49]:
# Investigate "price" column

# df_NY[df_NY["price"].isna() == True].loc[24]
df_NY["price"].loc[20:26]

def find_currency(df):
    currency = []
    for i in df_NY["price"].unique():
        currency.append(str(i)[0])
    currency.remove('n')
    return set(currency)

find_currency(df_NY)
Out[49]:
{'$'}
In [50]:
# Preis säubern: object -> float
df_NY["price"] = (df_NY["price"].astype(str)
                 .str.replace(r"[^\d.\-]", "", regex=True)
                 .replace({"": np.nan}))
df_NY["price"] = pd.to_numeric(df_NY["price"], errors="coerce")
In [51]:
# Offensichtliche Fehler -> NaN

df_NY.loc[df_NY["price"] <= 0, "price"] = np.nan
upper_cap = df_NY["price"].quantile(0.995)  # sehr extreme Spitzen

# Extrem Werte als NaN behandeln
df_NY.loc[df_NY["price"] > upper_cap, "price"] = np.nan
In [52]:
# Hierarchische Median-Imputation (fein)
groupings = [
    ["neighbourhood_group_cleansed","room_type","accommodates"],
    ["neighbourhood_group_cleansed","room_type"],
    ["room_type","accommodates"],
    ["room_type"],
]
for cols in groupings:
    cols = [c for c in cols if c in df_NY.columns]
    if not cols: 
        continue
    med = df_NY.groupby(cols)["price"].transform("median")
    need = df_NY["price"].isna() & med.notna()
    df_NY.loc[need, "price"] = med[need]

# Fallback: globaler Median  (grob)
# if df_NY["price"].isna().any():
#     df_NY["price"] = df_NY["price"].fillna(df_NY["price"].median())

# Check
print("Missing price after imputation:", df_NY["price"].isna().sum())
Missing price after imputation: 0

Spalte "bathrooms" bereinigen und fehlende Werte aus "bathrooms_text" füllen¶

Fehlende Werte 40.3%

Ziel: Die numerische Spalte bathrooms (float) vervollständigen, indem wir fehlende Werte aus der textuellen Spalte bathrooms_text ableiten. Beide Felder beschreiben dieselbe Information.

Vorgehen:

  1. Parsing-Regeln:
    • Zahl direkt vor bath/bathrooms (z. B. 1 bath, 1.5 baths, 2 bathrooms, 0 baths).
    • half-bath ohne Zahl ist gleich 0.5.
    • Zusätze wie private/shared werden ignoriert.
  2. Füllen: Nur dort ergänzen, wo bathrooms NaN ist – existierende numerische Werte werden nicht überschrieben.
  3. Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
In [54]:
df_NY[df_NY["bathrooms"].notna()][["bathrooms", "bathrooms_text"]].head(10)
Out[54]:
bathrooms bathrooms_text
0 1.0 1 private bath
1 1.0 1 bath
2 1.5 1.5 baths
3 1.0 1 bath
5 1.0 1 shared bath
7 1.0 1 private bath
8 1.0 1 bath
11 1.0 1 bath
14 1.0 1 bath
16 1.5 1.5 shared baths
In [55]:
# Investigate
df_NY["bathrooms"].unique()
Out[55]:
array([ 1. ,  1.5,  nan,  2. ,  2.5,  3. ,  5. ,  0. ,  0.5,  3.5, 15.5,
       10.5,  4. ,  4.5,  5.5,  6. ,  7. ,  7.5,  9. ,  6.5])
In [56]:
# Investigate
df_NY["bathrooms_text"].unique()
Out[56]:
array(['1 private bath', '1 bath', '1.5 baths', '1 shared bath',
       '1.5 shared baths', '2 baths', '2.5 baths', nan, '3 baths',
       '5 baths', '0 shared baths', '2 shared baths', 'Shared half-bath',
       '3.5 baths', '2.5 shared baths', 'Half-bath', 'Private half-bath',
       '0 baths', '15.5 baths', '3 shared baths', '10.5 baths', '4 baths',
       '4.5 baths', '3.5 shared baths', '4 shared baths',
       '4.5 shared baths', '5.5 baths', '6 baths', '7 baths',
       '6 shared baths', '7.5 baths', '9 baths', '5 shared baths',
       '6.5 baths'], dtype=object)
In [57]:
# Investigate: "bathrooms", "bathrooms_text"
df_bathrooms_null_bathrooms_text = df_NY[df_NY["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]

df_bathrooms_null_bathrooms_text.head(10)
# df_bathrooms_null_bathrooms_text.index.values

idx = df_bathrooms_null_bathrooms_text.index.values
In [58]:
# fill df_NY["bathrooms"] from df_NY["bathrooms_text"] only where it’s missing 
# (handles “1 bath”, “1.5 baths”, “Half-bath”, “shared half-bath”, “2 bathrooms”, etc.):
 
pattern = re.compile(r"""(?ix)
    ^\s*                                                          # allow leading spaces
    (?:                                                           # start alternation
        (?P<num>\d+(?:\.\d+)?)\s*                                 # 1, 1.5, 0, 15.5
        (?: (?:private|shared)\s+ )?                              # optional qualifier after number
        bath(?:room)?s?                                           # bath / bathroom(s)
      |
        (?: (?:private|shared)\s+ )?                              # optional qualifier before 'half'
        half[-\s]?bath(?:room)?s?                                 # half-bath / half bathroom(s)
    )
    \s*$                                                          # allow trailing spaces
""")

def parse_bath(text: str):
    """Parse bath count from bathrooms_text (your listed formats)."""
    if not isinstance(text, str) or not text.strip():
        return np.nan
    m = pattern.match(text.strip())
    if not m:
        return np.nan
    if m.group('num') is not None:        # numeric forms: '1 bath', '1.5 shared baths', '0 baths'
        return float(m.group('num'))
    else:                                  # 'half-bath', 'shared half-bath', 'private half-bath'
        return 0.5

# fill only missing bathrooms from parsed bathrooms_text
df_NY["bathrooms"] = df_NY["bathrooms"].fillna(df_NY["bathrooms_text"].apply(parse_bath))
In [59]:
# Missing values
df_NY["bathrooms"].isna().sum()
Out[59]:
np.int64(25)
In [60]:
df_NY["bathrooms"].unique()
Out[60]:
array([ 1. ,  1.5,  2. ,  2.5,  nan,  3. ,  5. ,  0. ,  0.5,  3.5, 15.5,
       10.5,  4. ,  4.5,  5.5,  6. ,  7. ,  7.5,  9. ,  6.5])
In [61]:
df_bathrooms_null_bathrooms_text_filled = df_NY[df_NY["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]

len(df_bathrooms_null_bathrooms_text_filled)
Out[61]:
25
In [62]:
df_bathrooms_null_bathrooms_text_filled.index.values
Out[62]:
array([   59,   389,   443,   457,   469,   828,   849,  2347,  2503,
        3018,  5822, 14940, 17366, 17367, 17381, 20029, 20054, 20057,
       28437, 30861, 33572, 34513, 34973, 35625, 35811])
In [63]:
df_NY = df_NY.drop("bathrooms_text", axis=1)

Spalte "host_neighbourhood" und "host_location" untersuchen¶

Fehlende Werte > 20%

Ziel: Die Spalte host_neighbourhood und host_location aufbereiten.

Vorgehen:

  • Daten: host_neighbourhood mit neighbourhood_cleansed vergleichen.
  • Daten: host_neighbourhood dropen.
  • Daten: host_location mit atitude und longitude vergleichen.
  • Daten: host_location dropen.
In [64]:
# Investigate: no description in the data Doctionary
df_NY["host_neighbourhood"].unique()[:10]

# array(['Gravesend', 'Midtown', 'Greenwood Heights', 'Williamsburg',
#        'East Harlem', 'Fort Greene', 'Ridgewood', 'Alphabet City',
#        'Greenwich Village', 'Harlem', ...], dtype=object)
Out[64]:
array(['Gravesend', 'Midtown', 'Greenwood Heights', 'Williamsburg',
       'East Harlem', 'Fort Greene', 'Ridgewood', 'Alphabet City',
       'Greenwich Village', 'Harlem'], dtype=object)
In [65]:
df_NY[df_NY["host_neighbourhood"].notnull()][["host_neighbourhood", "neighbourhood_cleansed"]].head(10)
Out[65]:
host_neighbourhood neighbourhood_cleansed
0 Gravesend Kensington
1 Midtown Midtown
2 Greenwood Heights Sunset Park
3 Williamsburg Williamsburg
4 East Harlem East Harlem
5 East Harlem East Harlem
6 Williamsburg Williamsburg
7 Fort Greene Fort Greene
8 Ridgewood Ridgewood
9 Alphabet City East Village
In [66]:
len(df_NY[df_NY["host_neighbourhood"] == df_NY["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
Out[66]:
16173
In [67]:
len(df_NY[df_NY["host_neighbourhood"] != df_NY["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
Out[67]:
20149
In [68]:
# Drop "host_neighbourhood". The reason: we have a column "neighbourhood_cleansed" with correct location
df_NY = df_NY.drop("host_neighbourhood", axis=1)
In [69]:
# Description in the data Doctionary: The host's self reported location
df_NY["host_location"].unique()[:10]

# array(['New York, NY', 'Woodstock, NY', ...])
Out[69]:
array(['New York, NY', 'Woodstock, NY', 'Las Vegas, NV',
       'New York, United States', 'Great Neck, NY', 'Berkeley, CA',
       'Los Angeles, CA', nan, 'Montreal, Canada', 'United States'],
      dtype=object)
In [70]:
# Compare the column "host_location" with "latitude" and "longitude"
df_NY[df_NY["host_location"].notnull()][["host_location", "latitude", "longitude"]].head(10)
Out[70]:
host_location latitude longitude
0 New York, NY 40.64529 -73.97238
1 Woodstock, NY 40.75356 -73.98559
2 New York, NY 40.66265 -73.99454
3 New York, NY 40.70935 -73.95342
4 New York, NY 40.80107 -73.94255
5 New York, NY 40.78778 -73.94759
6 New York, NY 40.71248 -73.95881
7 New York, NY 40.69194 -73.97389
8 New York, NY 40.70271 -73.89930
9 New York, NY 40.72530 -73.98028
In [71]:
# Investigate the column "host_location"
# Convert type df_NY["host_location"] to string
df_NY["host_location"] = df_NY["host_location"].astype("string") 

df_NY[df_NY["host_location"] == "Calgary, Canada"][["host_location", "latitude", "longitude"]]
#         host_location	latitude	longitude
# 34152	Calgary, Canada	40.678622	-73.946671
# BUT IT'S NY !!!!

df_NY[df_NY["host_location"] == "Vila Velha, Brazil"][["host_location", "latitude", "longitude"]]
#         host_location	latitude	longitude
# 35999	Vila Velha, Brazil	40.730115	-73.982319
# BUT IT'S NY !!!!
Out[71]:
host_location latitude longitude
35999 Vila Velha, Brazil 40.730115 -73.982319
In [72]:
# Drop "host_location" - a lot of redundant data
df_NY = df_NY.drop("host_location", axis=1)

Spalte "has_availability" untersuchen und fehlende Werte aus "availability_365" füllen¶

Fehlende Werte 15%

Ziel: Die boolische Spalte has_availability vervollständigen, indem wir fehlende Werte aus der Spalte availability_365 ableiten.

Vorgehen:

  1. Füllen: Nur dort ergänzen, wo has_availability NaN ist – existierende boolische Werte werden nicht überschrieben.
  2. Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
In [73]:
print("Missing \"has_availability\" before:", df_NY["has_availability"].isna().sum())
Missing "has_availability" before: 5617
In [74]:
# Investigate "has_availability "
df_NY["has_availability"].unique()
# [True, <NA>]

df_NY[df_NY["has_availability"].isnull()][["has_availability", "availability_30", "availability_60", "availability_90", "availability_365"]]
Out[74]:
has_availability availability_30 availability_60 availability_90 availability_365
13 <NA> 0 0 0 0
35 <NA> 0 0 0 0
42 <NA> 0 0 0 0
59 <NA> 0 0 0 0
71 <NA> 0 0 0 0
... ... ... ... ... ...
36268 <NA> 28 58 88 88
36273 <NA> 6 21 51 141
36276 <NA> 30 60 90 365
36306 <NA> 16 46 66 66
36311 <NA> 10 31 31 31

5617 rows × 5 columns

In [75]:
# Investigate "has_availability "
mask = (
    df_NY["availability_30"].eq(0) &
    df_NY["availability_60"].eq(0) &
    df_NY["availability_90"].eq(0) &
    df_NY["availability_365"].eq(0) &
    df_NY["has_availability"].isna()
)

count = mask.sum()          # number of rows
rows  = df_NY.loc[mask]        # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
Out[75]:
has_availability availability_30 availability_60 availability_90 availability_365
13 <NA> 0 0 0 0
35 <NA> 0 0 0 0
42 <NA> 0 0 0 0
59 <NA> 0 0 0 0
71 <NA> 0 0 0 0
... ... ... ... ... ...
33888 <NA> 0 0 0 0
33982 <NA> 0 0 0 0
34117 <NA> 0 0 0 0
34294 <NA> 0 0 0 0
36088 <NA> 0 0 0 0

5348 rows × 5 columns

In [76]:
mask = (
    df_NY["has_availability"].isna() |
    df_NY["availability_30"].eq(0) &
    df_NY["availability_60"].eq(0) &
    df_NY["availability_90"].eq(0) &
    df_NY["availability_365"].eq(0)
)

count = mask.sum()          # number of rows
rows  = df_NY.loc[mask]        # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
Out[76]:
has_availability availability_30 availability_60 availability_90 availability_365
6 True 0 0 0 0
9 True 0 0 0 0
10 True 0 0 0 0
13 <NA> 0 0 0 0
15 True 0 0 0 0
... ... ... ... ... ...
36268 <NA> 28 58 88 88
36273 <NA> 6 21 51 141
36276 <NA> 30 60 90 365
36306 <NA> 16 46 66 66
36311 <NA> 10 31 31 31

12650 rows × 5 columns

In [77]:
# Clean df_NY["has_availability"]
# Minimal (use 365 days only)
# fill the 15% missing in has_availability: True if there is at least 1 available day in the next 365, else False
df_NY["has_availability"] = (
    df_NY["has_availability"]
      .fillna(df_NY["availability_365"].gt(0))
      .astype("boolean")
)
In [78]:
print("Missing \"has_availability\" after:", df_NY["has_availability"].isna().sum())
Missing "has_availability" after: 0
In [79]:
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Out[79]:
name                            0.005506
host_since                      0.041297
host_acceptance_rate           40.201531
host_verifications              0.041297
host_identity_verified          0.041297
bathrooms                       0.068829
bedrooms                       16.651065
beds                           40.476846
estimated_revenue_l365d        40.920104
first_review                   30.700402
last_review                    30.700402
review_scores_rating           30.700402
review_scores_accuracy         30.727933
review_scores_cleanliness      30.703155
review_scores_checkin          30.738946
review_scores_communication    30.716921
review_scores_location         30.747206
review_scores_value            30.744452
license                        85.328451
reviews_per_month              30.700402
Name: Total_perc, dtype: float64

Spalte "estimated_revenue_l365d", "host_acceptance_rate" und "beds" untersuchen¶

Fehlende Werte > 40%

Ziel: Die Spalte estimated_revenue_l365d, host_acceptance_rate und bads aufbereiten.

Vorgehen:

  • Daten: estimated_revenue_l365d dropen.
  • Daten: host_acceptance_rate dropen.
  • Daten: bads dropen.
In [81]:
# Investigate: not in the dictionary
len(df_NY["estimated_revenue_l365d"].unique())
Out[81]:
2442
In [82]:
# Drop "estimated_revenue_l365d"
df_NY = df_NY.drop("estimated_revenue_l365d", axis=1)
In [83]:
# Investigate 
# "bedrooms" missing values 16.651065
# "beds": missing values 40.476846
# mask = (
#     df_NY["beds"].eq(0) &
#     df_NY["bedrooms"].eq(0)
# )

# count = mask.sum()  
# count
# rows  = df_NY.loc[mask] 
# rows[["beds", "bedrooms"]]
# 60

# mask = (
#     df_NY["beds"].isnull() &
#     df_NY["bedrooms"].notnull()
# )
# mask = (
#     df_NY["beds"].notnull() &
#     df_NY["bedrooms"].notnull() &
#     df_NY["beds"] > df_NY["bedrooms"]
# )
# rows  = df_NY.loc[mask]
# len(rows)*100/len(df) # 24 %

# Let's drop "beds" - viel missing data
df_NY = df_NY.drop("beds", axis=1)
In [84]:
# Do not need for statistic: drop "host_acceptance_rate"
df_NY = df_NY.drop("host_acceptance_rate", axis=1)

Spalten "first_review" und "last_review" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte > 30%

Ziel: Die Datumsfelder first_review und last_review in datetime64[ns] konvertieren, Ausreißer (Zukunft/Extremwerte) entfernen und fehlende Werte mit dem globalen Median-Datum pro Spalte füllen.

Vorgehen:

  • Datumsformat: in datetime64[ns] umwandeln.
  • Offensichtliche Ausreißer:
    • Zukunftsdaten (> heute) -> NaT.
    • Sehr späte Werte (z. B. > 99,5%-Quantil) -> NaT.
    • Sehr frühe Werte (< 0,5%-Quantil) -> NaT.
  • Konsistenz: sicherstellen, dass first_review ≤ last_review; sonst inkonsistente Werte auf NaT.
  • Check vor Imputation: Verteilungen/Anteil Missing prüfen.
  • Imputation: globaler Median je Spalte (first_review, last_review) zum Füllen von NaT.
  • Check nach Imputation: Verteilungen/Kennzahlen erneut prüfen; Anteil imputierter Werte dokumentieren.
In [85]:
# Check missing values
date_cols = ["first_review", "last_review"]

for c in date_cols:
    print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 11151
Missing values before imputation: 11151
In [86]:
# Datums-Extremwerte (einfach) -> NaT, dann mit Median füllen
date_cols = ["first_review", "last_review"]

for c in date_cols:
    s = pd.to_datetime(df_NY[c], errors="coerce")

    # Offensichtliche Ausreißer -> NaT
    s = s.mask(s > pd.Timestamp.today())      # Zukunftsdaten
    upper_cap = s.quantile(0.995)             # sehr späte Ausreißer (oberes 99,5%-Quantil)
    s = s.mask(s > upper_cap)

    # Sehr frühe Ausreißer
    lower_cap = s.quantile(0.005)
    s = s.mask(s < lower_cap)
In [89]:
# Konsistenz prüfen: first_review <= last_review, sonst konservativ auf NaT setzen
both = df_NY["first_review"].notna() & df_NY["last_review"].notna()
bad = both & (df_NY["first_review"] > df_NY["last_review"])
if bad.any():
    df_NY.loc[bad, ["first_review", "last_review"]] = pd.NaT        
In [90]:
# Missing mit globalem Median-Datum füllen
for c in ["first_review", "last_review"]:
    df_NY[c] = pd.to_datetime(df_NY[c], errors="coerce")  # -> datetime64[ns]
    med = df_NY[c].median()                            # Timestamp
    df_NY[c] = df_NY[c].fillna(med)                       # fill NaT with median
In [91]:
# Check missing values
for c in date_cols:
    print("Missing values after imputation:", df_NY[c].isna().sum())
Missing values after imputation: 0
Missing values after imputation: 0
In [92]:
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Out[92]:
name                            0.005506
host_since                      0.041297
host_verifications              0.041297
host_identity_verified          0.041297
bathrooms                       0.068829
bedrooms                       16.651065
review_scores_rating           30.700402
review_scores_accuracy         30.727933
review_scores_cleanliness      30.703155
review_scores_checkin          30.738946
review_scores_communication    30.716921
review_scores_location         30.747206
review_scores_value            30.744452
license                        85.328451
reviews_per_month              30.700402
Name: Total_perc, dtype: float64

Spalten "review_" und "reviews_" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte > 30%

Ziel: Alle numerischen Review-Spalten (z. B. review_scores_rating, review_scores_cleanliness, reviews_per_month) konsistent säubern, Ausreißer entfernen und fehlende Werte robust mit dem spaltenspezifischen Median füllen.

Auswahl der Spalten:

  • view_cols = [c for c in df_NY.columns if c.lower().startswith(("review_", "reviews_"))]

Vorgehen:

  • Typkonvertierung: ausgewählte Spalten in float konvertieren.
  • Offensichtliche Fehler: negative Werte -> NaN.
  • Ausreißer (hoch): Werte > 99,5%-Quantil pro Spalte -> NaN.
  • Ausreißer (niedrig): Werte < 0,5%-Quantil -> NaN.
  • Check vor Imputation: Verteilungen/Anteil Missing je Spalte prüfen.
  • Imputation: fehlende Werte spaltenweise mit dem Median füllen.
  • Check nach Imputation: Verteilungen und Anteil imputierter Werte dokumentieren; sicherstellen, dass keine negativen/außerhalb-Bereich-Werte verbleiben.
In [93]:
# Check missing values
review_cols = [c for c in df_NY.columns if c.lower().startswith(("review_", "reviews_"))]

for c in review_cols:
    print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 11151
Missing values before imputation: 11161
Missing values before imputation: 11152
Missing values before imputation: 11165
Missing values before imputation: 11157
Missing values before imputation: 11168
Missing values before imputation: 11167
Missing values before imputation: 11151
In [94]:
# Datums-Extremwerte -> NaT, dann mit Median füllen
for c in review_cols:
    # zu float konvertieren
    s = pd.to_numeric(df_NY[c], errors="coerce")

    # offensichtliche Fehler/Ausreißer -> NaN
    s = s.mask(s < 0)                 # negative Werte nicht zulässig
    upper_cap = s.quantile(0.995)     # sehr hohe Spitzen kappen
    s = s.mask(s > upper_cap)

    # Sehr niedrige Ausreißer kappen
    lower_cap = s.quantile(0.005)
    s = s.mask(s < lower_cap)

    # fehlende Werte mit Median füllen
    df_NY[c] = s.fillna(s.median())
In [95]:
# Check missing values
for c in review_cols:
    print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0
Missing values before imputation: 0

Spalten mit Fehlende Werte bereinigen und fehlende Werte ausfüllen¶

  • name: trimmen von Leerzeichen; fehlend -> „Unbenannt“.
  • host_since: Zukunft und >99,5%-Quantil -> NaT; fehlend -> Median-Datum.
  • host_verifications: fehlend -> "[]"; trimmen.
  • host_identity_verified: fehlend -> False.
  • bathrooms: gruppierter Median nach room_type × accommodates, sonst global; ≥0 clippen, auf 0,5 runden.
  • bedrooms: gruppierter Median, sonst global; ≥0 clippen.
  • license: wegen sehr hoher Missing-Rate droppen.
  • Report: verbleibende Missing je Spalte ausgeben.
In [96]:
# Spalten mit Fehlende Werte
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Out[96]:
name                       0.005506
host_since                 0.041297
host_verifications         0.041297
host_identity_verified     0.041297
bathrooms                  0.068829
bedrooms                  16.651065
license                   85.328451
Name: Total_perc, dtype: float64
In [97]:
# Spalte: "name"
df_NY["name"] = df_NY["name"].str.strip().fillna("Unbenannt")
In [98]:
# Spalte: "license" (sehr hohe Missing-Rate 85%)
df_NY["license"] = df_NY["license"].str.strip().fillna("Unbenannt")
In [99]:
# Spalte: "host_since"
s = df_NY["host_since"]
s = s.mask(s > pd.Timestamp.today())              # Zukunft → NaT
upper_cap = s.quantile(0.995)                     # sehr späte Ausreißer
s = s.mask(s > upper_cap)
df_NY["host_since"] = s.fillna(s.median())           # globaler Median pro Spalte
In [100]:
# Spalte: "host_verifications"
df_NY["host_verifications"] = df_NY["host_verifications"].fillna("[]").str.strip()
In [101]:
# Spalte: "host_identity_verified"
df_NY["host_identity_verified"] = df_NY["host_identity_verified"].fillna(False)
In [102]:
# Spalte: "bathrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_NY.columns]
if grp_cols:
    med_g = df_NY.groupby(grp_cols)["bathrooms"].transform("median")
    need = df_NY["bathrooms"].isna()
    df_NY.loc[need, "bathrooms"] = med_g[need]
# Fallback global
df_NY["bathrooms"] = df_NY["bathrooms"].fillna(df_NY["bathrooms"].median())
 # Aufräumen
df_NY["bathrooms"] = df_NY["bathrooms"].clip(lower=0)
df_NY["bathrooms"] = (np.round(df_NY["bathrooms"] * 2) / 2)  # 0.5-Schritte
In [103]:
# Spalte: "bedrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_NY.columns]
if grp_cols:
    med_g = df_NY.groupby(grp_cols)["bedrooms"].transform("median")
    need = df_NY["bedrooms"].isna()
    df_NY.loc[need, "bedrooms"] = med_g[need]
df_NY["bedrooms"] = df_NY["bedrooms"].fillna(df_NY["bedrooms"].median())
df_NY["bedrooms"] = df_NY["bedrooms"].clip(lower=0)
In [104]:
# Check
cols_report = ["name","host_since","host_verifications","host_identity_verified","bathrooms","bedrooms","license"]
present = [c for c in cols_report if c in df_NY.columns]
print("Bereinigte Spalten:", present)
print(df_NY[present].isna().sum().sort_values(ascending=False))
Bereinigte Spalten: ['name', 'host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'license']
name                      0
host_since                0
host_verifications        0
host_identity_verified    0
bathrooms                 0
bedrooms                  0
license                   0
dtype: int64
In [105]:
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Out[105]:
Series([], Name: Total_perc, dtype: float64)

Feature Engineering¶

Host since year erstellen¶

Die Vorgehensweise ist hier wie folgt: Spalte "host_since_year" erstellen und Jahr aus "host_since" extrahieren

Ziel: Jahr für einfache Gruppierungen/Trends ableiten.
Voraussetzung: "host_since" ist bereits datetime64.

In [106]:
df_NY["host_since"].head()
Out[106]:
0   2008-09-07
1   2008-09-09
2   2009-02-03
3   2009-05-06
4   2009-05-07
Name: host_since, dtype: datetime64[ns]
In [107]:
# extract year from "host_since"
df_NY["host_since_year"] = df_NY["host_since"].dt.year.astype("int64")
In [108]:
df_NY["host_since_year"].head()
Out[108]:
0    2008
1    2008
2    2009
3    2009
4    2009
Name: host_since_year, dtype: int64
In [109]:
# Erstellen der Sicherung (save dataframe to csv file)
df_NY.to_csv(f"{file_NY[:-4]}_clean.csv")

Data Understanding und Data Prepartion - Rome¶

Aufbereitung Datum: Detailed Listings für Rome, Lazio, Italy (2025)¶

Erster Blick in die Daten¶

Zu Beginn wird ein Ersteindruck der Daten gewonnen:

  • Wie sieht der Datensatz gesamt aus, welche Werte stehen oben, welche unten, was für Datentypen sind vorherrschend?
  • Ebenso werden die einzelnen Spalten etwas genauer angesehen, aufgeteilt u.a. in numerische und nicht-numerische Spalten.
  • Anschließend werden die numerischen Spalten genauer unter die Lupe genommen: Wie viele sind vorhanden, wie sehen das arithmetische Mittel, die Standardabweichung, die Quantile als auch die Max-Werte aus?
In [111]:
# Rome: Datensatz einlesen + erster Blick in die Daten
file_Rome = "data/Rome/listings.csv" 
data_Rome = pd.read_csv(file_Rome, low_memory=False)

# Head von data ansehen
data_Rome.head(2)
Out[111]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2737 https://www.airbnb.com/rooms/2737 20250612050804 2025-06-26 city scrape Elif's room in cozy, clean flat. 10 min by bus you can get to Piazza Venezia or... It used to be an industrial area until late 80... https://a0.muscache.com/pictures/41225252/e955... 3047 ... 5.00 4.40 4.40 NaN f 6 0 6 0 0.04
1 3079 https://www.airbnb.com/rooms/3079 20250612050804 2025-06-27 city scrape Cozy apartment (2-4)with Colisseum view With the view of the Colisseum from the front ... Monti neighborhood is one of the best areas in... https://a0.muscache.com/pictures/miso/Hosting-... 3504 ... 4.86 4.81 4.43 NaN f 6 6 0 0 0.12

2 rows × 79 columns

In [112]:
# Rome: Genauere Infos zu Spalten, Datentypen
data_Rome.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36309 entries, 0 to 36308
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            36309 non-null  int64  
 1   listing_url                                   36309 non-null  object 
 2   scrape_id                                     36309 non-null  int64  
 3   last_scraped                                  36309 non-null  object 
 4   source                                        36309 non-null  object 
 5   name                                          36309 non-null  object 
 6   description                                   35330 non-null  object 
 7   neighborhood_overview                         17656 non-null  object 
 8   picture_url                                   36309 non-null  object 
 9   host_id                                       36309 non-null  int64  
 10  host_url                                      36309 non-null  object 
 11  host_name                                     36293 non-null  object 
 12  host_since                                    36288 non-null  object 
 13  host_location                                 26981 non-null  object 
 14  host_about                                    16758 non-null  object 
 15  host_response_time                            30539 non-null  object 
 16  host_response_rate                            30539 non-null  object 
 17  host_acceptance_rate                          32737 non-null  object 
 18  host_is_superhost                             34304 non-null  object 
 19  host_thumbnail_url                            36288 non-null  object 
 20  host_picture_url                              36288 non-null  object 
 21  host_neighbourhood                            12168 non-null  object 
 22  host_listings_count                           36288 non-null  float64
 23  host_total_listings_count                     36288 non-null  float64
 24  host_verifications                            36288 non-null  object 
 25  host_has_profile_pic                          36288 non-null  object 
 26  host_identity_verified                        36288 non-null  object 
 27  neighbourhood                                 17656 non-null  object 
 28  neighbourhood_cleansed                        36309 non-null  object 
 29  neighbourhood_group_cleansed                  0 non-null      float64
 30  latitude                                      36309 non-null  float64
 31  longitude                                     36309 non-null  float64
 32  property_type                                 36309 non-null  object 
 33  room_type                                     36309 non-null  object 
 34  accommodates                                  36309 non-null  int64  
 35  bathrooms                                     32601 non-null  float64
 36  bathrooms_text                                36275 non-null  object 
 37  bedrooms                                      35532 non-null  float64
 38  beds                                          32583 non-null  float64
 39  amenities                                     36309 non-null  object 
 40  price                                         32591 non-null  object 
 41  minimum_nights                                36309 non-null  int64  
 42  maximum_nights                                36309 non-null  int64  
 43  minimum_minimum_nights                        36299 non-null  float64
 44  maximum_minimum_nights                        36299 non-null  float64
 45  minimum_maximum_nights                        36299 non-null  float64
 46  maximum_maximum_nights                        36299 non-null  float64
 47  minimum_nights_avg_ntm                        36309 non-null  float64
 48  maximum_nights_avg_ntm                        36309 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              35718 non-null  object 
 51  availability_30                               36309 non-null  int64  
 52  availability_60                               36309 non-null  int64  
 53  availability_90                               36309 non-null  int64  
 54  availability_365                              36309 non-null  int64  
 55  calendar_last_scraped                         36309 non-null  object 
 56  number_of_reviews                             36309 non-null  int64  
 57  number_of_reviews_ltm                         36309 non-null  int64  
 58  number_of_reviews_l30d                        36309 non-null  int64  
 59  availability_eoy                              36309 non-null  int64  
 60  number_of_reviews_ly                          36309 non-null  int64  
 61  estimated_occupancy_l365d                     36309 non-null  int64  
 62  estimated_revenue_l365d                       32591 non-null  float64
 63  first_review                                  31523 non-null  object 
 64  last_review                                   31523 non-null  object 
 65  review_scores_rating                          31523 non-null  float64
 66  review_scores_accuracy                        31516 non-null  float64
 67  review_scores_cleanliness                     31517 non-null  float64
 68  review_scores_checkin                         31516 non-null  float64
 69  review_scores_communication                   31517 non-null  float64
 70  review_scores_location                        31515 non-null  float64
 71  review_scores_value                           31516 non-null  float64
 72  license                                       33124 non-null  object 
 73  instant_bookable                              36309 non-null  object 
 74  calculated_host_listings_count                36309 non-null  int64  
 75  calculated_host_listings_count_entire_homes   36309 non-null  int64  
 76  calculated_host_listings_count_private_rooms  36309 non-null  int64  
 77  calculated_host_listings_count_shared_rooms   36309 non-null  int64  
 78  reviews_per_month                             31523 non-null  float64
dtypes: float64(24), int64(20), object(35)
memory usage: 21.9+ MB
In [113]:
# tail von data ansehen
data_Rome.tail(2)
Out[113]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
36307 1439417789689826311 https://www.airbnb.com/rooms/1439417789689826311 20250612050804 2025-06-25 city scrape New Studio B - Best Deal x 1 person Studio, cozy, renovated, perfect for a person ... NaN https://a0.muscache.com/pictures/hosting/Hosti... 34005675 ... NaN NaN NaN NaN f 12 8 4 0 NaN
36308 1439545861830252500 https://www.airbnb.com/rooms/1439545861830252500 20250612050804 2025-06-26 city scrape Via dell Orso - Rome Discover your perfect urban escape in the hear... Located in the heart of Rome, this neighborhoo... https://a0.muscache.com/pictures/prohost-api/H... 597677415 ... NaN NaN NaN IT058091C22E6TFIXA t 14 14 0 0 NaN

2 rows × 79 columns

In [114]:
# Genereller shape des Datensatzes
data_Rome.shape
Out[114]:
(36309, 79)
In [115]:
# Alle Columns anzeigen lassen
data_Rome.columns
Out[115]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy',
       'number_of_reviews_ly', 'estimated_occupancy_l365d',
       'estimated_revenue_l365d', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object')
In [116]:
# Rome: Aufspaltung in numerische und nicht-numerische Spalten

# Numerische Spalten
data_numeric = data_Rome.select_dtypes(include=[np.number])
numeric_cols = data_numeric.columns.values
print("\n", "1 - Numerische Spalten: ", "\n")
print(numeric_cols)

# Nicht-numerische Spalten
data_non_numeric = data_Rome.select_dtypes(exclude=[np.number])
non_numeric_cols = data_non_numeric.columns.values
print("\n", "2 - Nicht-numerische Spalten: ", "\n")
print(non_numeric_cols)
 1 - Numerische Spalten:  

['id' 'scrape_id' 'host_id' 'host_listings_count'
 'host_total_listings_count' 'neighbourhood_group_cleansed' 'latitude'
 'longitude' 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'minimum_nights'
 'maximum_nights' 'minimum_minimum_nights' 'maximum_minimum_nights'
 'minimum_maximum_nights' 'maximum_maximum_nights'
 'minimum_nights_avg_ntm' 'maximum_nights_avg_ntm' 'calendar_updated'
 'availability_30' 'availability_60' 'availability_90' 'availability_365'
 'number_of_reviews' 'number_of_reviews_ltm' 'number_of_reviews_l30d'
 'availability_eoy' 'number_of_reviews_ly' 'estimated_occupancy_l365d'
 'estimated_revenue_l365d' 'review_scores_rating' 'review_scores_accuracy'
 'review_scores_cleanliness' 'review_scores_checkin'
 'review_scores_communication' 'review_scores_location'
 'review_scores_value' 'calculated_host_listings_count'
 'calculated_host_listings_count_entire_homes'
 'calculated_host_listings_count_private_rooms'
 'calculated_host_listings_count_shared_rooms' 'reviews_per_month']

 2 - Nicht-numerische Spalten:  

['listing_url' 'last_scraped' 'source' 'name' 'description'
 'neighborhood_overview' 'picture_url' 'host_url' 'host_name' 'host_since'
 'host_location' 'host_about' 'host_response_time' 'host_response_rate'
 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url'
 'host_picture_url' 'host_neighbourhood' 'host_verifications'
 'host_has_profile_pic' 'host_identity_verified' 'neighbourhood'
 'neighbourhood_cleansed' 'property_type' 'room_type' 'bathrooms_text'
 'amenities' 'price' 'has_availability' 'calendar_last_scraped'
 'first_review' 'last_review' 'license' 'instant_bookable']
In [117]:
# Anzahl Numerische Spalten
data_Rome.select_dtypes(include="number").shape[1]
Out[117]:
44
In [118]:
# Anzahl Kategoriale Spalten
data_Rome.select_dtypes(include=["object"]).shape[1]
Out[118]:
35
In [119]:
# Rome: Numerische Werte genauer ansehen
data_Rome.describe().T
Out[119]:
count mean std min 25% 50% 75% max
id 36309.0 7.091096e+17 5.642698e+17 2.737000e+03 3.641325e+07 9.080119e+17 1.221495e+18 1.440786e+18
scrape_id 36309.0 2.025061e+13 0.000000e+00 2.025061e+13 2.025061e+13 2.025061e+13 2.025061e+13 2.025061e+13
host_id 36309.0 2.532322e+08 2.383981e+08 1.822000e+03 2.845217e+07 1.627744e+08 4.899548e+08 7.004666e+08
host_listings_count 36288.0 1.639385e+01 7.222286e+01 1.000000e+00 1.000000e+00 3.000000e+00 7.000000e+00 1.845000e+03
host_total_listings_count 36288.0 2.466755e+01 1.352076e+02 1.000000e+00 1.000000e+00 3.000000e+00 9.000000e+00 8.724000e+03
neighbourhood_group_cleansed 0.0 NaN NaN NaN NaN NaN NaN NaN
latitude 36309.0 4.189135e+01 3.522517e-02 4.165582e+01 4.188332e+01 4.189634e+01 4.190670e+01 4.212131e+01
longitude 36309.0 1.248055e+01 4.957734e-02 1.224253e+01 1.245916e+01 1.247763e+01 1.250551e+01 1.283570e+01
accommodates 36309.0 3.820155e+00 1.987939e+00 1.000000e+00 2.000000e+00 4.000000e+00 5.000000e+00 1.600000e+01
bathrooms 32601.0 1.309193e+00 6.803291e-01 0.000000e+00 1.000000e+00 1.000000e+00 1.500000e+00 2.150000e+01
bedrooms 35532.0 1.509259e+00 8.989650e-01 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.800000e+01
beds 32583.0 2.223706e+00 1.494002e+00 0.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+01
minimum_nights 36309.0 5.544851e+00 1.619246e+01 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 7.300000e+02
maximum_nights 36309.0 3.531390e+02 4.084050e+02 1.000000e+00 2.900000e+01 3.650000e+02 3.650000e+02 1.825000e+03
minimum_minimum_nights 36299.0 4.988347e+00 1.507299e+01 1.000000e+00 1.000000e+00 2.000000e+00 2.000000e+00 7.300000e+02
maximum_minimum_nights 36299.0 6.625417e+00 2.818776e+01 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.000000e+03
minimum_maximum_nights 36299.0 5.050061e+02 4.895995e+02 1.000000e+00 3.000000e+01 3.650000e+02 1.125000e+03 9.999000e+03
maximum_maximum_nights 36299.0 5.549658e+02 4.950607e+02 1.000000e+00 3.000000e+01 3.650000e+02 1.125000e+03 9.999000e+03
minimum_nights_avg_ntm 36309.0 5.518841e+00 1.701510e+01 1.000000e+00 1.300000e+00 2.000000e+00 3.000000e+00 9.194000e+02
maximum_nights_avg_ntm 36309.0 5.292944e+02 4.823069e+02 1.000000e+00 3.000000e+01 3.650000e+02 1.125000e+03 9.999000e+03
calendar_updated 0.0 NaN NaN NaN NaN NaN NaN NaN
availability_30 36309.0 1.319563e+01 9.657643e+00 0.000000e+00 5.000000e+00 1.200000e+01 2.100000e+01 3.000000e+01
availability_60 36309.0 3.075667e+01 1.886661e+01 0.000000e+00 1.500000e+01 3.100000e+01 4.700000e+01 6.000000e+01
availability_90 36309.0 4.971379e+01 2.756184e+01 0.000000e+00 2.900000e+01 5.200000e+01 7.300000e+01 9.000000e+01
availability_365 36309.0 2.138702e+02 1.129729e+02 0.000000e+00 1.240000e+02 2.350000e+02 3.160000e+02 3.650000e+02
number_of_reviews 36309.0 5.744204e+01 9.728829e+01 0.000000e+00 3.000000e+00 1.700000e+01 6.700000e+01 2.683000e+03
number_of_reviews_ltm 36309.0 1.436966e+01 1.888631e+01 0.000000e+00 1.000000e+00 6.000000e+00 2.200000e+01 5.090000e+02
number_of_reviews_l30d 36309.0 1.484370e+00 2.063618e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 2.500000e+01
availability_eoy 36309.0 1.166778e+02 5.555669e+01 0.000000e+00 8.300000e+01 1.290000e+02 1.620000e+02 1.900000e+02
number_of_reviews_ly 36309.0 1.328861e+01 1.950826e+01 0.000000e+00 0.000000e+00 3.000000e+00 2.100000e+01 5.210000e+02
estimated_occupancy_l365d 36309.0 8.180738e+01 9.086244e+01 0.000000e+00 6.000000e+00 4.200000e+01 1.440000e+02 2.550000e+02
estimated_revenue_l365d 32591.0 1.494125e+04 2.859316e+04 0.000000e+00 8.760000e+02 6.216000e+03 1.989000e+04 1.242870e+06
review_scores_rating 31523.0 4.769352e+00 3.278882e-01 1.000000e+00 4.690000e+00 4.860000e+00 4.980000e+00 5.000000e+00
review_scores_accuracy 31516.0 4.805095e+00 3.129784e-01 1.000000e+00 4.750000e+00 4.890000e+00 5.000000e+00 5.000000e+00
review_scores_cleanliness 31517.0 4.784912e+00 3.223713e-01 1.000000e+00 4.710000e+00 4.880000e+00 5.000000e+00 5.000000e+00
review_scores_checkin 31516.0 4.845727e+00 2.879957e-01 1.000000e+00 4.810000e+00 4.920000e+00 5.000000e+00 5.000000e+00
review_scores_communication 31517.0 4.858408e+00 2.946084e-01 1.000000e+00 4.840000e+00 4.950000e+00 5.000000e+00 5.000000e+00
review_scores_location 31515.0 4.754153e+00 3.169800e-01 1.000000e+00 4.650000e+00 4.840000e+00 4.970000e+00 5.000000e+00
review_scores_value 31516.0 4.684645e+00 3.617441e-01 1.000000e+00 4.600000e+00 4.760000e+00 4.890000e+00 5.000000e+00
calculated_host_listings_count 36309.0 9.320857e+00 2.734508e+01 1.000000e+00 1.000000e+00 2.000000e+00 6.000000e+00 2.440000e+02
calculated_host_listings_count_entire_homes 36309.0 7.774216e+00 2.703961e+01 0.000000e+00 1.000000e+00 1.000000e+00 3.000000e+00 2.440000e+02
calculated_host_listings_count_private_rooms 36309.0 1.455892e+00 3.547885e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 4.300000e+01
calculated_host_listings_count_shared_rooms 36309.0 2.277672e-02 5.159183e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.800000e+01
reviews_per_month 31523.0 1.806902e+00 1.642524e+00 1.000000e-02 4.800000e-01 1.340000e+00 2.750000e+00 3.973000e+01
In [120]:
# Rome: Describe von Spalte Location
data_Rome["bathrooms"].describe()
Out[120]:
count    32601.000000
mean         1.309193
std          0.680329
min          0.000000
25%          1.000000
50%          1.000000
75%          1.500000
max         21.500000
Name: bathrooms, dtype: float64
In [121]:
# Rome: Describe von Spalte Summary
data_Rome["availability_365"].describe()
Out[121]:
count    36309.000000
mean       213.870197
std        112.972871
min          0.000000
25%        124.000000
50%        235.000000
75%        316.000000
max        365.000000
Name: availability_365, dtype: float64
In [122]:
# Rome: Describe von Spalte 
data_Rome["host_since"].describe()
Out[122]:
count          36288
unique          4852
top       2014-11-09
freq             253
Name: host_since, dtype: object
In [123]:
# Datums-Typen bearbeiten: object -> datetime
if "host_since" in data_Rome.columns:
    data_Rome["host_since"] = pd.to_datetime(data_Rome["host_since"], errors="coerce")

data_Rome["host_since"].describe()
Out[123]:
count                            36288
mean     2018-08-24 00:31:01.904761856
min                2008-08-12 00:00:00
25%                2015-03-03 00:00:00
50%                2017-12-17 00:00:00
75%                2022-12-01 06:00:00
max                2025-06-10 00:00:00
Name: host_since, dtype: object

Zweiter Blick - Beispiele: Verfügbarkeit, Ausstattung, Host since¶

In [125]:
# Blick in die Daten
# Ausstattung New York ("amenities") von bestimmten loc(ation) lesen
data_Rome["amenities"].loc[3010]
Out[125]:
'["Fire extinguisher", "Kitchen", "Wifi", "Washer", "First aid kit", "Essentials", "Heating", "TV"]'
In [126]:
# Blick in die Daten
# Absturz von Ausstattung New York ("amenities") oben
data_Rome.loc[3010]
Out[126]:
id                                                                           8677467
listing_url                                     https://www.airbnb.com/rooms/8677467
scrape_id                                                             20250612050804
last_scraped                                                              2025-07-02
source                                                               previous scrape
                                                                ...                 
calculated_host_listings_count                                                     1
calculated_host_listings_count_entire_homes                                        1
calculated_host_listings_count_private_rooms                                       0
calculated_host_listings_count_shared_rooms                                        0
reviews_per_month                                                                NaN
Name: 3010, Length: 79, dtype: object
In [127]:
# Blick in die Daten
# "host_since" - type object
# Datums-Typen bearbeiten
data_Rome["host_since"].head(1)
Out[127]:
0   2008-09-18
Name: host_since, dtype: datetime64[ns]
In [128]:
# Blick in die Daten
if "host_since" in data_NY.columns:
    data_Rome["host_since"] = pd.to_datetime(data_Rome["host_since"], errors="coerce")
data_Rome["host_since"].head(10)
Out[128]:
0   2008-09-18
1   2008-10-08
2   2009-10-09
3   2009-03-30
4   2010-01-30
5   2010-01-30
6   2010-02-04
7   2010-02-04
8   2011-07-03
9   2011-07-05
Name: host_since, dtype: datetime64[ns]
In [129]:
# Blick in die Daten
# Hosts seit "2016-05-20"
data_Rome.loc[data_Rome["host_since"]=="2016-05-20"][["name", "host_since"]]
Out[129]:
name host_since
4100 Little Queen Guest House 2016-05-20
6203 Olimpico appartament 2016-05-20
7154 Appartamento A Roma Aquila 2 2016-05-20
7458 Casa Vacanze Aquila 3 2016-05-20
7953 Blancmaisonteveresuite 2016-05-20
11431 Appartamento fronte mare a Roma. 2016-05-20
13827 Appartamento Largo Arenula 2016-05-20
15173 Studio Appia 2016-05-20
17873 Blancmaisonvaticano 2016-05-20
22787 Rooms Sapienza 2016-05-20
32576 Cozy apartment Rome Eur 2016-05-20
In [130]:
# Blick in die Daten
# Top 5 der höchsten Verfügbarkeit (availability_365)
data_5first_availability_365 = data_Rome.sort_values(by=["availability_365"], ascending=False).head(5)
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood"]]
Out[130]:
name availability_365 host_since neighbourhood
0 Elif's room in cozy, clean flat. 365 2008-09-18 Rome, Lazio, Italy
20491 suite 1 365 2015-07-31 Rome, Lazio, Italy
19011 Camera Singola in Villa con Piscina 365 2023-07-27 NaN
19060 A.Livia Navona 365 2017-11-07 NaN
19111 Casa elegante centro Roma 365 2016-04-07 Rome, Lazio, Italy
In [131]:
# Blick in die Daten
# Top 5 der niedrigsten Verfügbarkeit (availability_365)
data_5last_availability_365 = data_Rome.sort_values(by=["availability_365"], ascending=False).tail(5)
data_5last_availability_365[["name", "availability_365", "host_since", "neighbourhood"]]
Out[131]:
name availability_365 host_since neighbourhood
9311 Best price 0 2019-09-01 NaN
9310 best price 0 2019-09-01 NaN
26331 Comfort vicino al Vaticano, ROOM 404 0 2024-07-06 NaN
26327 Trastevere Apartment 23 0 2024-07-08 NaN
2054 Corso 12 Luxury Rooms & Suites - Piazza Di Spagna 0 2015-01-20 Rome, Lazio, Italy
In [132]:
# Blick in die Daten
# Top 5 "availability_365" mit style.bar

data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood"]].style.bar(subset=["availability_365"])
Out[132]:
  name availability_365 host_since neighbourhood
0 Elif's room in cozy, clean flat. 365 2008-09-18 00:00:00 Rome, Lazio, Italy
20491 suite 1 365 2015-07-31 00:00:00 Rome, Lazio, Italy
19011 Camera Singola in Villa con Piscina 365 2023-07-27 00:00:00 nan
19060 A.Livia Navona 365 2017-11-07 00:00:00 nan
19111 Casa elegante centro Roma 365 2016-04-07 00:00:00 Rome, Lazio, Italy
In [133]:
# Blick in die Daten

# Check Summe von number_of_reviews
print("number_of_reviews: ", data_Rome["number_of_reviews"].sum())

# Check Summe von  property_type
print("property_type: ", len(data_Rome["property_type"].unique()))

# Check Summe von neighbourhood_group_cleansed
print("neighbourhood_group_cleansed: ", len(data_Rome["neighbourhood_group_cleansed"].unique()))
number_of_reviews:  2085663
property_type:  64
neighbourhood_group_cleansed:  1
In [134]:
# Investigate Room Type
data_Rome["room_type"].unique()
Out[134]:
array(['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'],
      dtype=object)

Deskriptive Statistik:¶

In diesem Teil werden wir uns den folgenden Kernthemen zuwenden:

  • Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
  • Konvertierung der "Date"-Spalte und boolische Spalte
  • Behandlung von Null-Values
In [135]:
# Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung

# Spalte behalten Info über lätzte Scrape
# calculated_host_listings_count 
# calculated_host_listings_count_entire_homes 
# calculated_host_listings_count_private_rooms 
# calculated_host_listings_count_shared_rooms 

# Spalte "host_name" - Name of the host. Usually just the first name(s).

# Spalten "host_listings_count", "host_total_listings_count": The number of listings the host has (per Airbnb unknown calculations)

# Spalte brauche für Analyse nicht:
drop_columns = ["listing_url", 
                "scrape_id", 
                "last_scraped", 
                "source",
                "description",
                "neighborhood_overview", 
                "picture_url",
                "host_url",
                "host_name",
                "host_about",
                "host_thumbnail_url",
                "host_picture_url",
                "host_listings_count",
                "host_total_listings_count",
                "host_has_profile_pic",
                "calendar_last_scraped",
                "calculated_host_listings_count",
                "calculated_host_listings_count_entire_homes",
                "calculated_host_listings_count_private_rooms",
                "calculated_host_listings_count_shared_rooms"
                ]

df_Rome = data_Rome.drop(columns=drop_columns, axis=1, errors="ignore")
In [136]:
print(df_Rome[["host_identity_verified","instant_bookable","has_availability"]].head())

# Bool-Typen bearbeiten
def to_bool(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip().lower()
    if s == "t" or x == True: return True
    if s == "f" or x == False:  return False
    return pd.NA

col_bool = ["host_identity_verified","instant_bookable","has_availability"]
    
for c in ["host_identity_verified", "instant_bookable","has_availability"]:
    if c in df_Rome.columns:
        df_Rome[c] = df_Rome[c].map(to_bool).astype("boolean")
  host_identity_verified instant_bookable has_availability
0                      t                f                t
1                      t                f                t
2                      t                f                t
3                      t                f                t
4                      t                t                t
In [137]:
print(df_Rome[["host_identity_verified","instant_bookable","has_availability"]].head())
   host_identity_verified  instant_bookable  has_availability
0                    True             False              True
1                    True             False              True
2                    True             False              True
3                    True             False              True
4                    True              True              True
In [138]:
# Spalte "license" - Werte [nan, 'Exempt', "IT*"]
# Spalte "license" nicht droppen
[i for i in df_Rome["license"].unique() if str(i)[:2].upper() != "IT"]
Out[138]:
[nan, 'I applied for a CIN, but haven’t received it yet', 'Exempt']
In [139]:
#Check for the missing values
df_Rome.isna().any()
Out[139]:
id                              False
name                            False
host_id                         False
host_since                       True
host_location                    True
host_response_time               True
host_response_rate               True
host_acceptance_rate             True
host_is_superhost                True
host_neighbourhood               True
host_verifications               True
host_identity_verified           True
neighbourhood                    True
neighbourhood_cleansed          False
neighbourhood_group_cleansed     True
latitude                        False
longitude                       False
property_type                   False
room_type                       False
accommodates                    False
bathrooms                        True
bathrooms_text                   True
bedrooms                         True
beds                             True
amenities                       False
price                            True
minimum_nights                  False
maximum_nights                  False
minimum_minimum_nights           True
maximum_minimum_nights           True
minimum_maximum_nights           True
maximum_maximum_nights           True
minimum_nights_avg_ntm          False
maximum_nights_avg_ntm          False
calendar_updated                 True
has_availability                 True
availability_30                 False
availability_60                 False
availability_90                 False
availability_365                False
number_of_reviews               False
number_of_reviews_ltm           False
number_of_reviews_l30d          False
availability_eoy                False
number_of_reviews_ly            False
estimated_occupancy_l365d       False
estimated_revenue_l365d          True
first_review                     True
last_review                      True
review_scores_rating             True
review_scores_accuracy           True
review_scores_cleanliness        True
review_scores_checkin            True
review_scores_communication      True
review_scores_location           True
review_scores_value              True
license                          True
instant_bookable                False
reviews_per_month                True
dtype: bool
In [140]:
# Zusätchlive Funktionen

# Missing Values
missing = df_Rome.isnull().sum().to_frame()

def missing_values(df):
    missing["Missing_values"] = df_Rome.isnull().sum().to_frame()
    missing["Total_perc"] = (missing["Missing_values"] / len(df)) * 100
    return missing
In [141]:
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
Out[141]:
host_since                        0.057837
host_location                    25.690600
host_response_time               15.891377
host_response_rate               15.891377
host_acceptance_rate              9.837781
host_is_superhost                 5.522047
host_neighbourhood               66.487648
host_verifications                0.057837
host_identity_verified            0.057837
neighbourhood                    51.372938
neighbourhood_group_cleansed    100.000000
bathrooms                        10.212344
bathrooms_text                    0.093641
bedrooms                          2.139965
beds                             10.261919
price                            10.239885
minimum_minimum_nights            0.027541
maximum_minimum_nights            0.027541
minimum_maximum_nights            0.027541
maximum_maximum_nights            0.027541
calendar_updated                100.000000
has_availability                  1.627696
estimated_revenue_l365d          10.239885
first_review                     13.181305
last_review                      13.181305
review_scores_rating             13.181305
review_scores_accuracy           13.200584
review_scores_cleanliness        13.197830
review_scores_checkin            13.200584
review_scores_communication      13.197830
review_scores_location           13.203338
review_scores_value              13.200584
license                           8.771930
reviews_per_month                13.181305
Name: Total_perc, dtype: float64
In [142]:
# Spalte: "calendar_updated" Missing-Rate 100%
df_Rome["calendar_updated"].unique()

# Spalte "neighbourhood" Missing-Rate 100%
df_Rome["neighbourhood_group_cleansed"].unique()

# Missing-Rate > 40% and no description in the data Doctionary:
# host_is_superhost 
# host_response_time
# host_response_rate
Out[142]:
array([nan])
In [144]:
df_Rome["neighbourhood_group_cleansed"] = df_Rome["neighbourhood"]
In [145]:
drop_columns = ["host_response_time",
                "host_response_rate",
                "host_is_superhost",
                "neighbourhood",
                "calendar_updated",
                ]

df_Rome = df_Rome.drop(columns=drop_columns, axis=1, errors="ignore")
# print(df_Rome.info())
In [146]:
# Median
df_Rome.median(numeric_only=True)
Out[146]:
id                             908011910692434944.0
host_id                                 162774411.0
host_identity_verified                          1.0
latitude                                  41.896342
longitude                                  12.47763
accommodates                                    4.0
bathrooms                                       1.0
bedrooms                                        1.0
beds                                            2.0
minimum_nights                                  2.0
maximum_nights                                365.0
minimum_minimum_nights                          2.0
maximum_minimum_nights                          2.0
minimum_maximum_nights                        365.0
maximum_maximum_nights                        365.0
minimum_nights_avg_ntm                          2.0
maximum_nights_avg_ntm                        365.0
has_availability                                1.0
availability_30                                12.0
availability_60                                31.0
availability_90                                52.0
availability_365                              235.0
number_of_reviews                              17.0
number_of_reviews_ltm                           6.0
number_of_reviews_l30d                          0.0
availability_eoy                              129.0
number_of_reviews_ly                            3.0
estimated_occupancy_l365d                      42.0
estimated_revenue_l365d                      6216.0
review_scores_rating                           4.86
review_scores_accuracy                         4.89
review_scores_cleanliness                      4.88
review_scores_checkin                          4.92
review_scores_communication                    4.95
review_scores_location                         4.84
review_scores_value                            4.76
instant_bookable                                1.0
reviews_per_month                              1.34
dtype: Float64
In [147]:
# Mean of "accommodates"
df_Rome["accommodates"].mean().round(2)
Out[147]:
np.float64(3.82)
In [148]:
# Mean of "minimum_nights" and "maximum_nights"
df_Rome["minimum_nights"].mean().round(2), df_Rome["maximum_nights"].mean().round(2) 
Out[148]:
(np.float64(5.54), np.float64(353.14))

IQR (Interquartile Range)¶

num = df_Rome.select_dtypes(include=[np.floating, np.integer]) num.quantile(0.75) - num.quantile(0.25)

In [150]:
df_Rome.describe(include=['object', 'category'])
Out[150]:
name host_location host_acceptance_rate host_neighbourhood host_verifications neighbourhood_cleansed neighbourhood_group_cleansed property_type room_type bathrooms_text amenities price first_review last_review license
count 36309 26981 32737 12168 36288 36309 17656 36309 36309 36275 36309 32591 31523 31523 33124
unique 35376 559 99 118 6 15 108 64 4 41 34807 1079 4173 1777 25919
top Vacanze Romane Rome, Italy 100% Prati ['email', 'phone'] I Centro Storico Rome, Lazio, Italy Entire rental unit Entire home/apt 1 bath ["Kitchen", "TV", "Washer", "Wifi", "Air condi... $90.00 2025-01-02 2025-06-22 I applied for a CIN, but haven’t received it yet
freq 16 24087 21186 1402 31342 17889 16951 18132 27345 19693 42 424 142 1192 378

Fehlende Daten - Null-Werte¶

Mithilfe von isnull.().sum() erkennen wir, dass fast alle Spalten Null-Werte haben:

In [152]:
df_Rome.isnull().sum()
Out[152]:
id                                  0
name                                0
host_id                             0
host_since                         21
host_location                    9328
host_acceptance_rate             3572
host_neighbourhood              24141
host_verifications                 21
host_identity_verified             21
neighbourhood_cleansed              0
neighbourhood_group_cleansed    18653
latitude                            0
longitude                           0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                        3708
bathrooms_text                     34
bedrooms                          777
beds                             3726
amenities                           0
price                            3718
minimum_nights                      0
maximum_nights                      0
minimum_minimum_nights             10
maximum_minimum_nights             10
minimum_maximum_nights             10
maximum_maximum_nights             10
minimum_nights_avg_ntm              0
maximum_nights_avg_ntm              0
has_availability                  591
availability_30                     0
availability_60                     0
availability_90                     0
availability_365                    0
number_of_reviews                   0
number_of_reviews_ltm               0
number_of_reviews_l30d              0
availability_eoy                    0
number_of_reviews_ly                0
estimated_occupancy_l365d           0
estimated_revenue_l365d          3718
first_review                     4786
last_review                      4786
review_scores_rating             4786
review_scores_accuracy           4793
review_scores_cleanliness        4792
review_scores_checkin            4793
review_scores_communication      4792
review_scores_location           4794
review_scores_value              4793
license                          3185
instant_bookable                    0
reviews_per_month                4786
dtype: int64
In [153]:
# Anzahl unterschiedlicher Zeilen: 36309
# Keine Duplikate
df_Rome.drop_duplicates().shape[0]
Out[153]:
36309

Spalten Price, Bathrooms, has_availability säubern¶

Spalte "price" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte 10.0%

Ziel: Die Spalte price sauber in Float konvertieren, Ausreißer entfernen und fehlende Werte hierarchischer Median-basiert imputieren.

Vorgehen:

  • Preis säubern: in Float umwandeln; Währungszeichen und Tausendertrennzeichen entfernen.
  • Unplausible Werte: price <= 0 oder sehr hoch (z. B. > 99,5%-Quantil) → NaN.
  • Check vor Imputation: Verteilungen/Kennzahlen vor dem Füllen prüfen.
  • Imputation: hierarchischer Median.
  • Check nach Imputation: Verteilungen/Kennzahlen nach dem Füllen prüfen.
In [155]:
print("Missing price before imputation:", df_Rome["price"].isna().sum())
Missing price before imputation: 3718
In [156]:
# Investigate "price" column
df_Rome["price"].loc[20:26]

def find_currency(df):
    currency = []
    for i in df["price"].unique():
        currency.append(str(i)[0])
    currency.remove('n')
    return set(currency)

find_currency(df_Rome)
Out[156]:
{'$'}
In [157]:
# Preis säubern: object -> float
df_Rome["price"] = (df_Rome["price"].astype(str)
                 .str.replace(r"[^\d.\-]", "", regex=True)
                 .replace({"": np.nan}))
df_Rome["price"] = pd.to_numeric(df_Rome["price"], errors="coerce")
In [158]:
# Fallback: globaler Median 
if df_Rome["price"].isna().any():
    df_Rome["price"] = df_Rome["price"].fillna(df_Rome["price"].median())

# Check
print("Missing price after imputation:", df_Rome["price"].isna().sum())
Missing price after imputation: 0

Spalte "bathrooms" bereinigen und fehlende Werte aus "bathrooms_text" füllen¶

Fehlende Werte 10.0%

Ziel: Die numerische Spalte bathrooms (float) vervollständigen, indem wir fehlende Werte aus der textuellen Spalte bathrooms_text ableiten. Beide Felder beschreiben dieselbe Information.

Vorgehen:

  1. Parsing-Regeln:
    • Zahl direkt vor bath/bathrooms (z. B. 1 bath, 1.5 baths, 2 bathrooms, 0 baths).
    • half-bath ohne Zahl ist gleich 0.5.
    • Zusätze wie private/shared werden ignoriert.
  2. Füllen: Nur dort ergänzen, wo bathrooms NaN ist – existierende numerische Werte werden nicht überschrieben.
  3. Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
In [160]:
df_Rome[df_Rome["bathrooms"].notna()][["bathrooms", "bathrooms_text"]].head(10)
Out[160]:
bathrooms bathrooms_text
0 1.5 1.5 baths
1 1.0 1 bath
2 1.0 1 bath
3 1.0 1 bath
4 1.0 1 bath
5 1.0 1 bath
6 1.0 1 bath
7 1.0 1 bath
8 2.0 2 shared baths
9 1.0 1 bath
In [161]:
# Investigate
df_Rome["bathrooms"].unique()
Out[161]:
array([ 1.5,  1. ,  2. ,  nan,  4. ,  3. ,  5. ,  6.5,  2.5,  4.5,  0. ,
        5.5,  6. , 12.5,  7. ,  8. ,  3.5,  0.5,  7.5, 12. ,  9.5, 21.5,
       10. ,  9. , 11. ])
In [162]:
# Investigate
df_Rome["bathrooms_text"].unique()
Out[162]:
array(['1.5 baths', '1 bath', '2 shared baths', '1 private bath',
       '2 baths', '4 baths', '3 baths', '5 baths', '6.5 baths',
       '3 shared baths', '2.5 baths', '1 shared bath', '4.5 baths',
       '0 shared baths', 'Half-bath', nan, '12.5 baths', '5.5 baths',
       '1.5 shared baths', '6 baths', '0 baths', '7 baths',
       '8 shared baths', '3.5 baths', 'Shared half-bath',
       '6 shared baths', '2.5 shared baths', '8 baths', '7.5 baths',
       '12 baths', '7 shared baths', '9.5 baths', '21.5 baths',
       '4 shared baths', '10 shared baths', '5 shared baths', '9 baths',
       'Private half-bath', '10 baths', '24 baths', '7.5 shared baths',
       '11 baths'], dtype=object)
In [163]:
# Investigate: "bathrooms", "bathrooms_text"
df_bathrooms_null_bathrooms_text = df_Rome[df_Rome["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]

df_bathrooms_null_bathrooms_text.head(10)
# df_bathrooms_null_bathrooms_text.index.values

idx = df_bathrooms_null_bathrooms_text.index.values
In [164]:
# fill df_Rome["bathrooms"] from df_Rome["bathrooms_text"] only where it’s missing 
# (handles “1 bath”, “1.5 baths”, “Half-bath”, “shared half-bath”, “2 bathrooms”, etc.):
 
pattern = re.compile(r"""(?ix)
    ^\s*                                                          # allow leading spaces
    (?:                                                           # start alternation
        (?P<num>\d+(?:\.\d+)?)\s*                                 # 1, 1.5, 0, 15.5
        (?: (?:private|shared)\s+ )?                              # optional qualifier after number
        bath(?:room)?s?                                           # bath / bathroom(s)
      |
        (?: (?:private|shared)\s+ )?                              # optional qualifier before 'half'
        half[-\s]?bath(?:room)?s?                                 # half-bath / half bathroom(s)
    )
    \s*$                                                          # allow trailing spaces
""")

def parse_bath(text: str):
    """Parse bath count from bathrooms_text (your listed formats)."""
    if not isinstance(text, str) or not text.strip():
        return np.nan
    m = pattern.match(text.strip())
    if not m:
        return np.nan
    if m.group('num') is not None:        # numeric forms: '1 bath', '1.5 shared baths', '0 baths'
        return float(m.group('num'))
    else:                                  # 'half-bath', 'shared half-bath', 'private half-bath'
        return 0.5

# fill only missing bathrooms from parsed bathrooms_text
df_Rome["bathrooms"] = df_Rome["bathrooms"].fillna(df_Rome["bathrooms_text"].apply(parse_bath))
In [165]:
# Missing values
df_Rome["bathrooms"].isna().sum()
Out[165]:
np.int64(21)
In [166]:
df_Rome["bathrooms"].unique()
Out[166]:
array([ 1.5,  1. ,  2. ,  4. ,  3. ,  5. ,  6.5,  2.5,  4.5,  0. ,  0.5,
       12.5,  5.5,  6. ,  7. ,  8. ,  3.5,  nan,  7.5, 12. ,  9.5, 21.5,
       10. ,  9. , 24. , 11. ])
In [167]:
df_bathrooms_null_bathrooms_text_filled = df_Rome[df_Rome["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]

len(df_bathrooms_null_bathrooms_text_filled)
Out[167]:
21
In [168]:
df_bathrooms_null_bathrooms_text_filled.index.values
Out[168]:
array([ 2229,  4822,  4854,  5303,  5482,  7216,  7636,  8374,  8716,
        8768,  8929, 18185, 20526, 20528, 21166, 22192, 32461, 33110,
       33699, 33980, 36113])
In [169]:
df_Rome = df_Rome.drop("bathrooms_text", axis=1, errors="ignore")

Spalte "host_neighbourhood" und "host_location" untersuchen¶

Fehlende Werte > 66% und 25% entspr.

Ziel: Die Spalte host_neighbourhood und host_location aufbereiten.

Vorgehen:

  • Daten: host_neighbourhood mit neighbourhood_cleansed vergleichen.
  • Daten: host_neighbourhood dropen.
  • Daten: host_location mit atitude und longitude vergleichen.
  • Daten: host_location dropen.
In [170]:
# Investigate: no description in the data Doctionary
df_Rome["host_neighbourhood"].unique()[:10]
Out[170]:
array(['Testaccio', 'Monti', 'Flaminio', 'Prati', 'Trastevere',
       'San Lorenzo', 'Esquilino', nan, 'Pigneto', 'Aurelio'],
      dtype=object)
In [171]:
df_Rome[df_Rome["host_neighbourhood"].notnull()][["host_neighbourhood", "neighbourhood_cleansed"]].head(10)
Out[171]:
host_neighbourhood neighbourhood_cleansed
0 Testaccio VIII Appia Antica
1 Monti I Centro Storico
2 Monti I Centro Storico
3 Flaminio II Parioli/Nomentano
4 Prati I Centro Storico
5 Prati I Centro Storico
6 Trastevere I Centro Storico
7 Trastevere I Centro Storico
8 San Lorenzo II Parioli/Nomentano
9 Monti I Centro Storico
In [172]:
len(df_Rome[df_Rome["host_neighbourhood"] == df_Rome["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
Out[172]:
0
In [173]:
len(df_Rome[df_Rome["host_neighbourhood"] != df_Rome["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
Out[173]:
36309
In [174]:
# Drop "host_neighbourhood". The reason: we have a column "neighbourhood_cleansed" with correct location
df_Rome = df_Rome.drop("host_neighbourhood", axis=1, errors="ignore")
In [175]:
# Description in the data Doctionary: The host's self reported location
df_Rome["host_location"].unique()[:10]
Out[175]:
array(['Rome, Italy', nan, 'Milan, Italy', 'Vienna, Austria',
       'Cerveteri, Italy', 'Alella, Spain', 'Ladispoli, Italy',
       'London, United Kingdom', 'Giulianova, Italy', 'United Kingdom'],
      dtype=object)
In [176]:
# Compare the column "host_location" with "latitude" and "longitude"
df_Rome[df_Rome["host_location"].notnull()][["host_location", "latitude", "longitude"]].head(10)
Out[176]:
host_location latitude longitude
0 Rome, Italy 41.871360 12.482150
1 Rome, Italy 41.895000 12.491170
2 Rome, Italy 41.895447 12.491181
3 Rome, Italy 41.925820 12.469280
4 Rome, Italy 41.907310 12.453500
5 Rome, Italy 41.909200 12.453440
6 Rome, Italy 41.889920 12.468230
7 Rome, Italy 41.889920 12.468230
8 Rome, Italy 41.897860 12.514600
9 Rome, Italy 41.888360 12.501480
In [177]:
# Investigate the column "host_location"
# Convert type df_Rome["host_location"] to string
df_Rome["host_location"] = df_Rome["host_location"].astype("string") 
In [178]:
# Drop "host_location" - a lot of redundant data
df_Rome = df_Rome.drop("host_location", axis=1, errors="ignore")

Spalte "has_availability" untersuchen und fehlende Werte aus "availability_365" füllen¶

Fehlende Werte 1.6%

Ziel: Die boolische Spalte has_availability vervollständigen, indem wir fehlende Werte aus der Spalte availability_365 ableiten.

Vorgehen:

  1. Füllen: Nur dort ergänzen, wo has_availability NaN ist – existierende boolische Werte werden nicht überschrieben.
  2. Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
In [179]:
print("Missing \"has_availability\" before:", df_Rome["has_availability"].isna().sum())
Missing "has_availability" before: 591
In [180]:
# Investigate "has_availability "
df_Rome["has_availability"].unique()
# [True, <NA>]

df_Rome[df_Rome["has_availability"].isnull()][["has_availability", "availability_30", "availability_60", "availability_90", "availability_365"]]
Out[180]:
has_availability availability_30 availability_60 availability_90 availability_365
246 <NA> 0 0 0 0
248 <NA> 0 0 0 0
321 <NA> 0 0 0 0
354 <NA> 0 0 0 0
386 <NA> 0 0 0 0
... ... ... ... ... ...
36091 <NA> 15 16 26 281
36103 <NA> 30 60 90 365
36263 <NA> 19 49 76 351
36269 <NA> 29 59 89 364
36284 <NA> 25 55 85 175

591 rows × 5 columns

In [181]:
# Investigate "has_availability "
mask = (
    df_Rome["availability_30"].eq(0) &
    df_Rome["availability_60"].eq(0) &
    df_Rome["availability_90"].eq(0) &
    df_Rome["availability_365"].eq(0) &
    df_Rome["has_availability"].isna()
)

count = mask.sum()          # number of rows
rows  = df_Rome.loc[mask]        # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
Out[181]:
has_availability availability_30 availability_60 availability_90 availability_365
246 <NA> 0 0 0 0
248 <NA> 0 0 0 0
321 <NA> 0 0 0 0
354 <NA> 0 0 0 0
386 <NA> 0 0 0 0
... ... ... ... ... ...
32065 <NA> 0 0 0 0
32220 <NA> 0 0 0 0
32236 <NA> 0 0 0 0
33791 <NA> 0 0 0 0
35726 <NA> 0 0 0 0

517 rows × 5 columns

In [182]:
mask = (
    df_Rome["has_availability"].isna() |
    df_Rome["availability_30"].eq(0) &
    df_Rome["availability_60"].eq(0) &
    df_Rome["availability_90"].eq(0) &
    df_Rome["availability_365"].eq(0)
)

count = mask.sum()          # number of rows
rows  = df_Rome.loc[mask]        # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
Out[182]:
has_availability availability_30 availability_60 availability_90 availability_365
14 True 0 0 0 0
33 True 0 0 0 0
43 True 0 0 0 0
82 True 0 0 0 0
103 True 0 0 0 0
... ... ... ... ... ...
36103 <NA> 30 60 90 365
36155 True 0 0 0 0
36263 <NA> 19 49 76 351
36269 <NA> 29 59 89 364
36284 <NA> 25 55 85 175

2229 rows × 5 columns

In [183]:
# Clean df_Rome["has_availability"]
# Minimal (use 365 days only)
# fill the 15% missing in has_availability: True if there is at least 1 available day in the next 365, else False
df_Rome["has_availability"] = (
    df_Rome["has_availability"]
      .fillna(df_Rome["availability_365"].gt(0))
      .astype("boolean")
)
In [184]:
print("Missing \"has_availability\" after:", df_Rome["has_availability"].isna().sum())
Missing "has_availability" after: 0
In [185]:
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
Out[185]:
host_since                       0.057837
host_acceptance_rate             9.837781
host_verifications               0.057837
host_identity_verified           0.057837
neighbourhood_group_cleansed    51.372938
bathrooms                        0.057837
bedrooms                         2.139965
beds                            10.261919
minimum_minimum_nights           0.027541
maximum_minimum_nights           0.027541
minimum_maximum_nights           0.027541
maximum_maximum_nights           0.027541
estimated_revenue_l365d         10.239885
first_review                    13.181305
last_review                     13.181305
review_scores_rating            13.181305
review_scores_accuracy          13.200584
review_scores_cleanliness       13.197830
review_scores_checkin           13.200584
review_scores_communication     13.197830
review_scores_location          13.203338
review_scores_value             13.200584
license                          8.771930
reviews_per_month               13.181305
Name: Total_perc, dtype: float64

Spalte "estimated_revenue_l365d", "host_acceptance_rate" und "beds" untersuchen¶

Fehlende Werte > 10%

Ziel: Die Spalte estimated_revenue_l365d, host_acceptance_rate und bads aufbereiten.

Vorgehen:

  • Daten: estimated_revenue_l365d dropen.
  • Daten: host_acceptance_rate dropen.
  • Daten: bads dropen.
In [187]:
# Investigate: not in the dictionary
len(df_Rome["estimated_revenue_l365d"].unique())
Out[187]:
5238
In [188]:
# Drop "estimated_revenue_l365d"
df_Rome = df_Rome.drop("estimated_revenue_l365d", axis=1, errors="ignore")
In [189]:
# Investigate 
# mask = (
#     df_Rome["beds"].eq(0) &
#     df_Rome["bedrooms"].eq(0)
# )

# count = mask.sum()  
# count
# rows  = df_Rome.loc[mask] 
# rows[["beds", "bedrooms"]]
# 60

# mask = (
#     df_Rome["beds"].isnull() &
#     df_Rome["bedrooms"].notnull()
# )
# mask = (
#     df_Rome["beds"].notnull() &
#     df_Rome["bedrooms"].notnull() &
#     df_Rome["beds"] > df_Rome["bedrooms"]
# )
# rows  = df_Rome.loc[mask]
# len(rows)*100/len(df) # 24 %

# Let's drop "beds" - viel missing data
df_Rome = df_Rome.drop("beds", axis=1)
In [190]:
# Do not need for statistic: drop "host_acceptance_rate"
df_Rome = df_Rome.drop("host_acceptance_rate", axis=1)

Spalten "first_review" und "last_review" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte > 13%

Ziel: Die Datumsfelder first_review und last_review in datetime64[ns] konvertieren, Ausreißer (Zukunft/Extremwerte) entfernen und fehlende Werte mit dem globalen Median-Datum pro Spalte füllen.

Vorgehen:

  • Datumsformat: in datetime64[ns] umwandeln.
  • Offensichtliche Ausreißer:
    • Zukunftsdaten (> heute) -> NaT.
    • Sehr späte Werte (z. B. > 99,5%-Quantil) -> NaT.
    • Sehr frühe Werte (< 0,5%-Quantil) -> NaT.
  • Konsistenz: sicherstellen, dass first_review ≤ last_review; sonst inkonsistente Werte auf NaT.
  • Check vor Imputation: Verteilungen/Anteil Missing prüfen.
  • Imputation: globaler Median je Spalte (first_review, last_review) zum Füllen von NaT.
  • Check nach Imputation: Verteilungen/Kennzahlen erneut prüfen; Anteil imputierter Werte dokumentieren.
In [193]:
# Check missing values
date_cols = ["first_review", "last_review"]

for c in date_cols:
    print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786
Missing values before imputation: 4786
In [194]:
# Datums-Extremwerte (einfach) -> NaT, dann mit Median füllen
date_cols = ["first_review", "last_review"]

for c in date_cols:
    s = pd.to_datetime(df_Rome[c], errors="coerce")

    # Offensichtliche Ausreißer -> NaT
    s = s.mask(s > pd.Timestamp.today())      # Zukunftsdaten
    upper_cap = s.quantile(0.995)             # sehr späte Ausreißer (oberes 99,5%-Quantil)
    s = s.mask(s > upper_cap)

    # Sehr frühe Ausreißer
    lower_cap = s.quantile(0.005)
    s = s.mask(s < lower_cap)
In [196]:
# Konsistenz prüfen: first_review <= last_review, sonst konservativ auf NaT setzen
both = df_Rome["first_review"].notna() & df_Rome["last_review"].notna()
bad = both & (df_Rome["first_review"] > df_Rome["last_review"])
if bad.any():
    df_Rome.loc[bad, ["first_review", "last_review"]] = pd.NaT        
In [197]:
# Missing mit globalem Median-Datum füllen
for c in ["first_review", "last_review"]:
    df_Rome[c] = pd.to_datetime(df_Rome[c], errors="coerce")  # -> datetime64[ns]
    med = df_Rome[c].median()                            # Timestamp
    df_Rome[c] = df_Rome[c].fillna(med)                       # fill NaT with median
In [198]:
# Check missing values
for c in date_cols:
    print("Missing values after imputation:", df_Rome[c].isna().sum())
Missing values after imputation: 0
Missing values after imputation: 0
In [199]:
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
Out[199]:
host_since                       0.057837
host_verifications               0.057837
host_identity_verified           0.057837
neighbourhood_group_cleansed    51.372938
bathrooms                        0.057837
bedrooms                         2.139965
minimum_minimum_nights           0.027541
maximum_minimum_nights           0.027541
minimum_maximum_nights           0.027541
maximum_maximum_nights           0.027541
review_scores_rating            13.181305
review_scores_accuracy          13.200584
review_scores_cleanliness       13.197830
review_scores_checkin           13.200584
review_scores_communication     13.197830
review_scores_location          13.203338
review_scores_value             13.200584
license                          8.771930
reviews_per_month               13.181305
Name: Total_perc, dtype: float64

Spalten "review_" und "reviews_" bereinigen und fehlende Werte ausfüllen¶

Fehlende Werte > 13%

Ziel: Alle numerischen Review-Spalten (z. B. review_scores_rating, review_scores_cleanliness, reviews_per_month) konsistent säubern, Ausreißer entfernen und fehlende Werte robust mit dem spaltenspezifischen Median füllen.

Auswahl der Spalten:

  • view_cols = [c for c in df_Rome.columns if c.lower().startswith(("review_", "reviews_"))]

Vorgehen:

  • Typkonvertierung: ausgewählte Spalten in float konvertieren.
  • Offensichtliche Fehler: negative Werte -> NaN.
  • Ausreißer (hoch): Werte > 99,5%-Quantil pro Spalte -> NaN.
  • Ausreißer (niedrig): Werte < 0,5%-Quantil -> NaN.
  • Check vor Imputation: Verteilungen/Anteil Missing je Spalte prüfen.
  • Imputation: fehlende Werte spaltenweise mit dem Median füllen.
  • Check nach Imputation: Verteilungen und Anteil imputierter Werte dokumentieren; sicherstellen, dass keine negativen/außerhalb-Bereich-Werte verbleiben.
In [200]:
# Check missing values
review_cols = [c for c in df_Rome.columns if c.lower().startswith(("review_", "reviews_"))]

for c in review_cols:
    print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786
Missing values before imputation: 4793
Missing values before imputation: 4792
Missing values before imputation: 4793
Missing values before imputation: 4792
Missing values before imputation: 4794
Missing values before imputation: 4793
Missing values before imputation: 4786
In [ ]:
# Datums-Extremwerte -> NaT, dann mit Median füllen

for c in review_cols:
    # zu float konvertieren
    s = pd.to_numeric(df_Rome[c], errors="coerce")

    # offensichtliche Fehler/Ausreißer -> NaN
    s = s.mask(s < 0)                 # negative Werte nicht zulässig
    upper_cap = s.quantile(0.995)     # sehr hohe Spitzen kappen
    s = s.mask(s > upper_cap)

    # Sehr niedrige Ausreißer kappen
    lower_cap = s.quantile(0.005)
    s = s.mask(s < lower_cap)

    # fehlende Werte mit Median füllen
    df_Rome[c] = s.fillna(s.median())
In [201]:
# Check missing values
for c in review_cols:
    print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786
Missing values before imputation: 4793
Missing values before imputation: 4792
Missing values before imputation: 4793
Missing values before imputation: 4792
Missing values before imputation: 4794
Missing values before imputation: 4793
Missing values before imputation: 4786

Spalten mit Fehlende Werte bereinigen und fehlende Werte ausfüllen¶

  • name: trimmen von Leerzeichen; fehlend -> „Unbenannt“.
  • host_since: Zukunft und >99,5%-Quantil -> NaT; fehlend -> Median-Datum.
  • host_verifications: fehlend -> "[]"; trimmen.
  • host_identity_verified: fehlend -> False.
  • bathrooms: gruppierter Median nach room_type × accommodates, sonst global; ≥0 clippen, auf 0,5 runden.
  • bedrooms: gruppierter Median, sonst global; ≥0 clippen.
  • license: wegen sehr hoher Missing-Rate droppen.
  • Report: verbleibende Missing je Spalte ausgeben.
In [202]:
# Spalten mit Fehlende Werte
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
Out[202]:
host_since                       0.057837
host_verifications               0.057837
host_identity_verified           0.057837
neighbourhood_group_cleansed    51.372938
bathrooms                        0.057837
bedrooms                         2.139965
minimum_minimum_nights           0.027541
maximum_minimum_nights           0.027541
minimum_maximum_nights           0.027541
maximum_maximum_nights           0.027541
review_scores_rating            13.181305
review_scores_accuracy          13.200584
review_scores_cleanliness       13.197830
review_scores_checkin           13.200584
review_scores_communication     13.197830
review_scores_location          13.203338
review_scores_value             13.200584
license                          8.771930
reviews_per_month               13.181305
Name: Total_perc, dtype: float64
In [203]:
# Spalte: "name"
df_Rome["name"] = df_Rome["name"].str.strip().fillna("Unbenannt")
In [204]:
# Spalte: "license" (sehr hohe Missing-Rate 85%)
df_Rome["license"] = df_Rome["license"].str.strip().fillna("Unbenannt")
In [205]:
# Spalte: "host_since"
s = df_Rome["host_since"]
s = s.mask(s > pd.Timestamp.today())              # Zukunft → NaT
upper_cap = s.quantile(0.995)                     # sehr späte Ausreißer
s = s.mask(s > upper_cap)
df_Rome["host_since"] = s.fillna(s.median())           # globaler Median pro Spalte
In [206]:
# Spalte: "host_verifications"
df_Rome["host_verifications"] = df_Rome["host_verifications"].fillna("[]").str.strip()
In [207]:
# Spalte: "host_identity_verified"
df_Rome["host_identity_verified"] = df_Rome["host_identity_verified"].fillna(False)
In [208]:
# Spalte: "bathrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_Rome.columns]
if grp_cols:
    med_g = df_Rome.groupby(grp_cols)["bathrooms"].transform("median")
    need = df_Rome["bathrooms"].isna()
    df_Rome.loc[need, "bathrooms"] = med_g[need]
# Fallback global
df_Rome["bathrooms"] = df_Rome["bathrooms"].fillna(df_Rome["bathrooms"].median())
 # Aufräumen
df_Rome["bathrooms"] = df_Rome["bathrooms"].clip(lower=0)
df_Rome["bathrooms"] = (np.round(df_Rome["bathrooms"] * 2) / 2)  # 0.5-Schritte
In [209]:
# Spalte: "bedrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_Rome.columns]
if grp_cols:
    med_g = df_Rome.groupby(grp_cols)["bedrooms"].transform("median")
    need = df_Rome["bedrooms"].isna()
    df_Rome.loc[need, "bedrooms"] = med_g[need]
df_Rome["bedrooms"] = df_Rome["bedrooms"].fillna(df_Rome["bedrooms"].median())
df_Rome["bedrooms"] = df_Rome["bedrooms"].clip(lower=0)
In [210]:
# Check
cols_report = ["name","host_since","host_verifications","host_identity_verified","bathrooms","bedrooms","license"]
present = [c for c in cols_report if c in df_Rome.columns]
print("Bereinigte Spalten:", present)
print(df_Rome[present].isna().sum().sort_values(ascending=False))
Bereinigte Spalten: ['name', 'host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'license']
name                      0
host_since                0
host_verifications        0
host_identity_verified    0
bathrooms                 0
bedrooms                  0
license                   0
dtype: int64
In [211]:
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
Out[211]:
neighbourhood_group_cleansed    51.372938
minimum_minimum_nights           0.027541
maximum_minimum_nights           0.027541
minimum_maximum_nights           0.027541
maximum_maximum_nights           0.027541
review_scores_rating            13.181305
review_scores_accuracy          13.200584
review_scores_cleanliness       13.197830
review_scores_checkin           13.200584
review_scores_communication     13.197830
review_scores_location          13.203338
review_scores_value             13.200584
reviews_per_month               13.181305
Name: Total_perc, dtype: float64

Feature Engineering¶

Host since year erstellen¶

Die Vorgehensweise ist hier wie folgt: Spalte "host_since_year" erstellen und Jahr aus "host_since" extrahieren

Ziel: Jahr für einfache Gruppierungen/Trends ableiten.
Voraussetzung: "host_since" ist bereits datetime64.

In [212]:
df_Rome["host_since"].head()
Out[212]:
0   2008-09-18
1   2008-10-08
2   2009-10-09
3   2009-03-30
4   2010-01-30
Name: host_since, dtype: datetime64[ns]
In [213]:
# extract year from "host_since"
df_Rome["host_since_year"] = df_Rome["host_since"].dt.year.astype("int64")
In [214]:
df_Rome["host_since_year"].head()
Out[214]:
0    2008
1    2008
2    2009
3    2009
4    2010
Name: host_since_year, dtype: int64
In [215]:
# Erstellen der Sicherung (save dataframe to csv file)
df_Rome.to_csv(f"{file_Rome[:-4]}_clean.csv")

EDA-Plan (Airbnb New York)¶

Univariat (Verteilungen)¶

  • Preis ("price")
    • Verteilung (Histogramm)"price" und "log(price)"
  • Kapazität & Ausstattung
    • "accommodates", "bedrooms", "bathrooms" -> Histogramm und Boxplot
  • Verfügbarkeit
    • "availability_365"
  • Bewertungen
    • "number_of_reviews", "reviews_per_month", Scores "review_scores_*"
  • Kategorien
    • "room_type", "property_type", "neighbourhood_group_cleansed"

Vorbereitung der Daten¶

Einlesen der Daten

  • Reine Index-Spalte "Unnamed: 0" droppen
In [216]:
file_NY = "data/NY/listings_clean.csv" 
df_NY = pd.read_csv(file_NY, low_memory=False)
df_NY.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36322 entries, 0 to 36321
Data columns (total 50 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Unnamed: 0                    36322 non-null  int64  
 1   id                            36322 non-null  int64  
 2   name                          36322 non-null  object 
 3   host_id                       36322 non-null  int64  
 4   host_since                    36322 non-null  object 
 5   host_verifications            36322 non-null  object 
 6   host_identity_verified        36322 non-null  bool   
 7   neighbourhood_cleansed        36322 non-null  object 
 8   neighbourhood_group_cleansed  36322 non-null  object 
 9   latitude                      36322 non-null  float64
 10  longitude                     36322 non-null  float64
 11  property_type                 36322 non-null  object 
 12  room_type                     36322 non-null  object 
 13  accommodates                  36322 non-null  int64  
 14  bathrooms                     36322 non-null  float64
 15  bedrooms                      36322 non-null  float64
 16  amenities                     36322 non-null  object 
 17  price                         36322 non-null  float64
 18  minimum_nights                36322 non-null  int64  
 19  maximum_nights                36322 non-null  int64  
 20  minimum_minimum_nights        36322 non-null  int64  
 21  maximum_minimum_nights        36322 non-null  int64  
 22  minimum_maximum_nights        36322 non-null  int64  
 23  maximum_maximum_nights        36322 non-null  int64  
 24  minimum_nights_avg_ntm        36322 non-null  float64
 25  maximum_nights_avg_ntm        36322 non-null  float64
 26  has_availability              36322 non-null  bool   
 27  availability_30               36322 non-null  int64  
 28  availability_60               36322 non-null  int64  
 29  availability_90               36322 non-null  int64  
 30  availability_365              36322 non-null  int64  
 31  number_of_reviews             36322 non-null  int64  
 32  number_of_reviews_ltm         36322 non-null  int64  
 33  number_of_reviews_l30d        36322 non-null  int64  
 34  availability_eoy              36322 non-null  int64  
 35  number_of_reviews_ly          36322 non-null  int64  
 36  estimated_occupancy_l365d     36322 non-null  int64  
 37  first_review                  36322 non-null  object 
 38  last_review                   36322 non-null  object 
 39  review_scores_rating          36322 non-null  float64
 40  review_scores_accuracy        36322 non-null  float64
 41  review_scores_cleanliness     36322 non-null  float64
 42  review_scores_checkin         36322 non-null  float64
 43  review_scores_communication   36322 non-null  float64
 44  review_scores_location        36322 non-null  float64
 45  review_scores_value           36322 non-null  float64
 46  license                       36322 non-null  object 
 47  instant_bookable              36322 non-null  bool   
 48  reviews_per_month             36322 non-null  float64
 49  host_since_year               36322 non-null  int64  
dtypes: bool(3), float64(15), int64(21), object(11)
memory usage: 13.1+ MB
In [217]:
# Unnamed: 0 als reine Index-Spalte aus EDA ausschließen
df_NY = df_NY.drop(columns="Unnamed: 0", axis=1, errors="ignore")
In [218]:
for c in ["host_since", "first_review", "last_review"]:
    df_NY[c] = pd.to_datetime(df_NY[c], errors="coerce")
In [219]:
# Preis (raw & log1p)
fig, axes = plt.subplots(1, 2, figsize=(12,4))
sns.histplot(data=df_NY, x="price", bins=50, ax=axes[0])
axes[0].set_title("Preis (USD/Nacht)")
sns.histplot(data=df_NY, x=np.log1p(df_NY["price"]), bins=50, ax=axes[1])
axes[1].set_title("log(1+Preis)")
plt.show()
No description has been provided for this image
In [220]:
# Kapazität & Ausstattung
for col in ["accommodates","bedrooms","bathrooms"]:
    if col in df_NY.columns:
        fig, ax = plt.subplots(figsize=(6,4))
        sns.histplot(data=df_NY, x=col, bins=40, ax=ax)
        ax.set_title(f"Verteilung: {col}")
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [221]:
# Verfügbarkeit (Beispiel: 365)
fig, ax = plt.subplots(figsize=(6,4))
sns.histplot(data=df_NY, x="availability_365", bins=50, ax=ax)
ax.set_title("Verfügbarkeit (365 Tage)")
plt.tight_layout()
No description has been provided for this image
In [222]:
# Reviews: Anzahl / Rate / Score
for col in ["number_of_reviews", "reviews_per_month", "review_scores_rating"]:
    fig, ax = plt.subplots(figsize=(6,4))
    sns.histplot(data=df_NY, x=col, bins=50, ax=ax)
    ax.set_title(f"Verteilung: {col}")
    plt.tight_layout()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [223]:
# Kategoriale Häufigkeiten
for col in ["room_type","property_type", "neighbourhood_group_cleansed"]:
    fig, ax = plt.subplots(figsize=(7,4))
    order = df_NY[col].value_counts().index[:20]  # Top 20
    sns.countplot(data=df_NY, y=col, order=order, ax=ax)
    ax.set_title(f"Häufigkeiten: {col}")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Bivariat: Korrelation und Gruppenvergleiche¶

  • Numerisch vs numerisch (Spearman bevorzugt)
    • "price" vs "accommodates", "bathrooms", "bedrooms", "availability_365", "reviews_per_month", "estimated_occupancy_l365d"
    • Visuals: Scatter, Korrelations-Heatmap
  • Kategorial vs numerisch (Gruppenvergleiche)
    • "price" nach "room_type" & "neighbourhood_group_cleansed" -> Pivot, Boxplot
    • "reviews_per_month" nach "neighbourhood_group_cleansed"
  • Kategorial vs kategorial
    • "room_type × neighbourhood_group_cleansed" -> Kreuztabelle, Balkendiagramm, Heatmap
In [224]:
# Spearman-Korrelation (numeric)
num_cols = [c for c in df_NY.columns if pd.api.types.is_numeric_dtype(df_NY[c]) and c not in ["id"]]
corr_s = df_NY[num_cols].corr(method="spearman")

plt.figure(figsize=(10,8))
sns.heatmap(corr_s, cmap="coolwarm", vmin=-1, vmax=1, center=0)
plt.title("Spearman-Korrelation (numeric)")
plt.show()
No description has been provided for this image
In [225]:
# Scatter: price vs accommodates
mask = df_NY["price"] < 2000

plt.figure()
plt.scatter(df_NY["accommodates"], np.log1p(df_NY["price"]), alpha=0.3, s=8)
plt.title("price vs accommodates")
plt.xlabel("accommodates"); plt.ylabel("price")
plt.show()
No description has been provided for this image
In [226]:
# Boxplots: price nach Room Type / Accommodates
mask = df_NY["price"] < 2000

plt.figure(figsize=(6,4))
df_NY[mask].boxplot(column="price", by="room_type")
plt.title("Preis nach Room Type"); plt.suptitle("")
plt.xlabel("room_type"); plt.ylabel("price")
plt.show()

plt.figure(figsize=(6,4))
df_NY[mask].boxplot(column="price", by="accommodates")
plt.title("Preis nach Accommodates"); plt.suptitle("")
plt.xlabel("Accommodates"); plt.ylabel("price")
plt.show()
<Figure size 600x400 with 0 Axes>
No description has been provided for this image
<Figure size 600x400 with 0 Axes>
No description has been provided for this image
In [227]:
# Pivot: Median-Preis Borough (administrative unit) × Room Type
pivot = df_NY.pivot_table(
    index="neighbourhood_group_cleansed",
    columns="room_type",
    values="price",
    aggfunc="median"
).round(2)
print("Median-Preis (USD/Nacht): Borough × Room Type")
display(pivot)
Median-Preis (USD/Nacht): Borough × Room Type
room_type Entire home/apt Hotel room Private room Shared room
neighbourhood_group_cleansed
Bronx 125.5 NaN 68.0 37.0
Brooklyn 174.0 40000.0 90.0 40.0
Manhattan 235.0 20659.5 140.5 57.0
Queens 150.0 237.0 85.0 76.0
Staten Island 120.0 NaN 75.0 88.5
In [228]:
# Ordnung nach Häufigkeit (optional, macht Plots übersichtlicher)
row_order = df_NY["room_type"].value_counts().index
col_order = df_NY["neighbourhood_group_cleansed"].value_counts().index

# Counts
ct = pd.crosstab(
    df_NY["room_type"], df_NY["neighbourhood_group_cleansed"]
).reindex(index=row_order, columns=col_order)

# Zeilenprozente (Anteile je room_type)
ct_row = pd.crosstab(
    df_NY["room_type"], df_NY["neighbourhood_group_cleansed"], normalize="index"
).reindex(index=row_order, columns=col_order) * 100

print("Counts:\n", ct, "\n")
print("Row %:\n", ct_row.round(1), "\n")
Counts:
 neighbourhood_group_cleansed  Manhattan  Brooklyn  Queens  Bronx  \
room_type                                                          
Entire home/apt                   10281      6543    1977    454   
Private room                       5421      6679    3326    737   
Hotel room                          307        19       9      0   
Shared room                          72        88      43      1   

neighbourhood_group_cleansed  Staten Island  
room_type                                    
Entire home/apt                         180  
Private room                            183  
Hotel room                                0  
Shared room                               2   

Row %:
 neighbourhood_group_cleansed  Manhattan  Brooklyn  Queens  Bronx  \
room_type                                                          
Entire home/apt                    52.9      33.7    10.2    2.3   
Private room                       33.2      40.9    20.3    4.5   
Hotel room                         91.6       5.7     2.7    0.0   
Shared room                        35.0      42.7    20.9    0.5   

neighbourhood_group_cleansed  Staten Island  
room_type                                    
Entire home/apt                         0.9  
Private room                            1.1  
Hotel room                              0.0  
Shared room                             1.0   

In [229]:
# Heatmap (Zeilenprozente)
plt.figure(figsize=(7,4))
sns.heatmap(ct_row, annot=True, fmt=".1f", cmap="Blues")
plt.title("room_type × Borough — Row %")
plt.xlabel("neighbourhood_group_cleansed"); plt.ylabel("room_type")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [230]:
# Gestapeltes Balkendiagramm (Zeilenprozente)
ax = ct_row.plot(kind="bar", stacked=True, figsize=(7,4))
ax.set_ylabel("Anteil (%)"); ax.set_title("room_type × Borough — Row % (gestapelt)")
plt.legend(title="neighbourhood_group_cleansed", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
No description has been provided for this image

Multivariat¶

  • Pairplot auf Variablen: "price, accommodates, bedrooms, bathrooms, reviews_per_mont"
  • Preis-Modell (deskriptiv, nicht prädiktiv)
    • OLS Modelle (statsmodels): "log(price) ~ accommodates + bedrooms + bathrooms + room_type + neighbourhood_group_cleansed + instant_bookable + review_scores_rating"
In [231]:
# Pairplot auf ausgewählten Variablen (Downsampling für Performance)
cols_pp = [c for c in ["price","accommodates","bedrooms","bathrooms","reviews_per_month"]]
sample_pp = df_NY[cols_pp + (["room_type"] if "room_type" in df_NY.columns else [])].sample(min(10000, len(df_NY)), random_state=42)
sns.pairplot(sample_pp, hue="room_type" if "room_type" in sample_pp.columns else None, diag_kind="hist")
plt.suptitle("Pairplot: ausgewählte Variablen", y=1.02)
Out[231]:
Text(0.5, 1.02, 'Pairplot: ausgewählte Variablen')
No description has been provided for this image
In [232]:
#  Multivariat: einfache Modelle (statsmodels)
# OLS: log(price) ~ Prädiktoren
df_NY_mod = df_NY.copy()
df_NY_mod = df_NY_mod[df_NY_mod["price"] > 0].copy()
df_NY_mod["log_price"] = np.log(df_NY_mod["price"])

# Formel
formula_ols = "log_price ~ accommodates + bedrooms + bathrooms + C(room_type) + C(neighbourhood_group_cleansed) + instant_bookable + review_scores_rating"
model_ols = smf.ols(formula=formula_ols, data=df_NY_mod).fit()
print(model_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.654
Model:                            OLS   Adj. R-squared:                  0.654
Method:                 Least Squares   F-statistic:                     5710.
Date:                Wed, 17 Sep 2025   Prob (F-statistic):               0.00
Time:                        12:35:04   Log-Likelihood:                -24862.
No. Observations:               36322   AIC:                         4.975e+04
Df Residuals:                   36309   BIC:                         4.986e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
====================================================================================================================
                                                       coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------------
Intercept                                            3.6210      0.043     85.168      0.000       3.538       3.704
C(room_type)[T.Hotel room]                           2.7949      0.027    103.640      0.000       2.742       2.848
C(room_type)[T.Private room]                        -0.4671      0.006    -79.693      0.000      -0.479      -0.456
C(room_type)[T.Shared room]                         -0.8188      0.034    -24.240      0.000      -0.885      -0.753
C(neighbourhood_group_cleansed)[T.Brooklyn]          0.2477      0.015     17.034      0.000       0.219       0.276
C(neighbourhood_group_cleansed)[T.Manhattan]         0.6583      0.014     45.418      0.000       0.630       0.687
C(neighbourhood_group_cleansed)[T.Queens]            0.1077      0.015      7.003      0.000       0.078       0.138
C(neighbourhood_group_cleansed)[T.Staten Island]     0.0090      0.029      0.313      0.754      -0.047       0.065
instant_bookable[T.True]                             0.2456      0.007     37.648      0.000       0.233       0.258
accommodates                                         0.1544      0.002     85.544      0.000       0.151       0.158
bedrooms                                            -0.0196      0.004     -5.381      0.000      -0.027      -0.012
bathrooms                                            0.0726      0.006     12.646      0.000       0.061       0.084
review_scores_rating                                 0.1425      0.008     17.351      0.000       0.126       0.159
==============================================================================
Omnibus:                     7981.767   Durbin-Watson:                   1.683
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            91795.925
Skew:                           0.729   Prob(JB):                         0.00
Kurtosis:                      10.650   Cond. No.                         107.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Zeitliche Charachteristiks¶

  • Host-Kohorten
    • "host_since_year" -> Entwicklung der Host-Neuzugänge über die Jahre
  • Aktualität
    • "reviews_per_month" -> Aktivität nach Borough: Median reviews_per_month
In [233]:
# Hosts pro Jahr (Kohorten)
coh = df_NY.groupby("host_since_year")["host_id"].nunique().reset_index()
sns.barplot(data=coh, x="host_since_year", y="host_id")
plt.title("Neue Hosts pro Jahr")
plt.xlabel("Jahr")
plt.ylabel("Anzahl Hosts")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [234]:
# Aktivität nach Borough: Median reviews_per_month
rpm = df_NY.groupby("neighbourhood_group_cleansed")["reviews_per_month"].median().reset_index()
plt.figure(figsize=(7,3.5))
sns.barplot(data=rpm, x="neighbourhood_group_cleansed", y="reviews_per_month")
plt.title("Median Reviews/Monat nach Borough")
plt.show()
No description has been provided for this image

Visualisierung der Listings (Plotly Express)¶

Geo-Visualisierung der Listings New Jork¶

Ziel: New York Airbnb-Listings sichtbar machen – nach Borough, Preis, Dichte und Objekttyp.

Datenbasis & Kartenstil

  • Maskenbildung:** Einmalige mask auf price < 1500.
  • Verwendet wurde ein Sample des Datensatzes (sample) zur Performance.
  • Kartenzentrum: Median von Latitude/Longitude; Stil: carto-positron.
  • Hover-Infos: name, neighbourhood_cleansed, room_type, price (bzw. property_type, availability_365). Visualisierungen
  • Punktekarte – eingefärbt nach Borough (neighbourhood_group_cleansed)
    -> Zeigt die räumliche Verteilung der Listings über die Boroughs und lokale Häufungen.
  • Punktekarte – eingefärbt nach Preis (kontinuierlich)
    -> Hebt Preis-Hotspots und günstigere Zonen hervor; erleichtert visuelle Outlier-Erkennung.
  • Dichtekarte (Heatmap)
    -> Visualisiert Listing-Cluster unabhängig von der Farbe. Der Parameter radius steuert die Glättung (größer = glatter).
  • Punktekarte – eingefärbt nach Objekttyp (property_type)
    -> Zeigt den räumlichen Mix der Unterkunftsarten und mögliche Typ-Cluster.

Interpretationshinweise

  • Bei Punktkarten können sich Marker überlagern; die Dichtekarte ergänzt hier den Überblick.
  • Farbskalen unterstützen das Erkennen von Hotspots und Strukturunterschieden zwischen Stadtteilen.
  • Für belastbare Schlüsse mit Aggregaten (z. B. Median-Preis je Borough) gegenprüfen.
In [235]:
# Map center from data
center = dict(lat=df_NY["latitude"].median(), lon=df_NY["longitude"].median())

mask = df_NY["price"] < 1500
sample = df_NY[mask]

# Downsample for performance (adjust N as you like)
# sample = df_NY.sample(min(80000, len(df_NY)), random_state=42)
In [236]:
# Points map, colored by borough
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="neighbourhood_group_cleansed",
    hover_data=["name","neighbourhood_cleansed","room_type","price"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings by Borough")
fig.show()
No description has been provided for this image
In [237]:
# Points map, colored by price (continuous)
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="price",
    hover_data=["name","neighbourhood_cleansed","room_type","price"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Price")
fig.show()
No description has been provided for this image
In [238]:
# Density map (where listings cluster)
fig = px.density_map(
    sample,
    lat="latitude", lon="longitude",
    z=None, # or "price" to weight by price
    radius=8, # increase for smoother heat
    center=center, zoom=9, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listing Density (Heat)")
fig.show()
No description has been provided for this image
In [239]:
# Points map, colored by property_type
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="property_type",
    hover_data=["name","neighbourhood_cleansed","room_type","property_type"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Property type")
fig.show()
No description has been provided for this image
In [240]:
# Points map, colored by availability_365
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="availability_365",
    hover_data=["name","neighbourhood_cleansed","room_type","availability_365"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by availability_365")
fig.show()
No description has been provided for this image

Dash mit Diagramm und Schieberegler: Airbnb NYC – Jahresfilter: Verfügbarkeit vs Preis¶

Ziel: Interaktive Punktekarte (availability_365 vs. price) mit Jahres-Slider.

Farbe = Borough (neighbourhood_group_cleansed), Punktgröße = number_of_reviews.

Schritte

  • Slider-Jahre: Aus host_since_year ermittelt.
  • Callback: Filter als eine kombinierte Bedingung.
  • Plot: px.scatter mit template="plotly_white", Farbe = Borough, Größe = Reviews, Hover zeigt room_type, bedrooms, bathrooms.

HTML-Export & Öffnen

  • Speichern als HTML (self-contained):
  • Link im Standard-Browser öffnen:
  • Link im Notebook einbetten:

Interpretation

  • Steigung/Form: Beziehung zwischen Verfügbarkeit (365 Tage) und Preis.
  • Farbcluster: Räumliche Segmente (Boroughs).
  • Punktgröße: Signalisiert Popularität/Aktivität über Reviews.
In [242]:
out_file_NY = "airbnb_price_availability_yearslider_NY.html"

# slider years
years = sorted([int(y) for y in df_NY["host_since_year"].unique()])
initial_year = years[0]

app = Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id="graph-with-slider"),
    dcc.Slider(
        min=years[0], max=years[-1], step=None, value=initial_year,
        marks={str(y): str(y) for y in years},
        id="year-slider"
    )
])

@app.callback(
    Output("graph-with-slider", "figure"),
    Input("year-slider", "value")
)
def update_figure(selected_year):
    # align mask to df_NY
    cond = (df_NY["host_since_year"] == selected_year) & mask
    df_NYf = df_NY.loc[cond]

    if df_NYf.empty:
        fig.update_layout(template="plotly_white")
        return fig

    # Example scatter: x=availability_36, y=price, color=borough, size=number_of_reviews
    fig = px.scatter(
        df_NYf,
        x="availability_365", y="price",
        color="neighbourhood_group_cleansed",
        size="number_of_reviews",
        hover_data=["room_type", "bedrooms", "bathrooms"],
        size_max=18,
        title=f"Listings – {selected_year}",
        template="plotly_white"
    )
    
    # Save as HTML
    fig.write_html(out_file_NY, include_plotlyjs="cdn", full_html=True)
    
    fig.update_layout(transition_duration=500, legend_title_text="Borough")
    return fig

if __name__ == "__main__":
    app.run(debug=True, port=8050)
In [243]:
# Embed the link inside the notebook
# IFrame(src=out_file_NY, width="100%", height=600)
In [244]:
# Open directly in the default browser
# webbrowser.open("file://" + os.path.abspath(out_file_NY))

Dashboard: Tabelle und Histogramm mit Radio-Auswahl. Komponenten nebeneinander zeigen.¶

Ziel: Ein Dash-Layout, das eine Daten-Tabelle und ein Histogramm anzeigt. Per Radio-Buttons wählst du die Kennzahl, die im Plot aggregiert dargestellt wird.

Aufbau

  • Styling: Einbindung eines externen CSS (CodePen) für das Grid-Layout.
  • Kontrollelemente: dcc.RadioItems zur Auswahl der Metrik
    (Optionen: price, review_scores_rating, availability_365).
  • Inhalt in zwei Spalten:
    • Links: Dash Tabelle mit den Datensätzen.
    • Rechts: Dash Graph für das Histogramm.

Interaktion (Callback)

  • Input: Wert der Radio-Buttons.
  • Output: Aktualisiert die Figur.
  • Logik: px.histogram mit
    x='neighbourhood_group_cleansed',
    y=col_chosen,
    histfunc='avg' -> zeigt durchschnittliche Werte je Borough.

Start

  • Serverstart über app.run(debug=True, port=8051) und Aufruf im Browser (lokale URL).
In [246]:
# Initialize the app - incorporate css
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(external_stylesheets=external_stylesheets)

# App layout
app.layout = [
    html.Div(className='row', children='Tabelle, Kontrollelemente und Histogramm mit durchschnittliche Werte je Borough',
             style={'textAlign': 'center', 'color': 'blue', 'fontSize': 30}),

    html.Div(className='row', children=[
        dcc.RadioItems(options=['price', 'review_scores_rating', 'availability_365'],
                       value='review_scores_rating',
                       inline=True,
                       id='my-radio-buttons-final')
    ]),

    html.Div(className='row', children=[
        html.Div(className='six columns', children=[
            dash_table.DataTable(data=df_NY.to_dict('records'), page_size=11, style_table={'overflowX': 'auto'})
        ]),
        html.Div(className='six columns', children=[
            dcc.Graph(figure={}, id='histo-chart-final')
        ])
    ])
]

# Add controls to build the interaction
@callback(
    Output(component_id='histo-chart-final', component_property='figure'),
    Input(component_id='my-radio-buttons-final', component_property='value')
)
def update_graph(col_chosen):
    fig = px.histogram(df_NY, x='neighbourhood_group_cleansed', y=col_chosen, histfunc='avg')
    return fig

# Run the app
if __name__ == '__main__':
    app.run(debug=True, port=8051)

Geo-Visualisierung der Listings Rome¶

Ziel: Rome Airbnb-Listings sichtbar machen – nach Borough, Preis, Dichte und Objekttyp.

Datenbasis & Kartenstil

  • Maskenbildung:** Einmalige mask auf price < 1500.
  • Verwendet wurde ein Sample des Datensatzes (sample) zur Performance.
  • Kartenzentrum: Median von Latitude/Longitude; Stil: carto-positron.
  • Hover-Infos: name, neighbourhood_cleansed, room_type, price (bzw. property_type, availability_365). Visualisierungen
  • Punktekarte – eingefärbt nach Borough (neighbourhood_group_cleansed)
    -> Zeigt die räumliche Verteilung der Listings über die Boroughs und lokale Häufungen.
  • Punktekarte – eingefärbt nach Preis (kontinuierlich)
    -> Hebt Preis-Hotspots und günstigere Zonen hervor; erleichtert visuelle Outlier-Erkennung.
  • Dichtekarte (Heatmap)
    -> Visualisiert Listing-Cluster unabhängig von der Farbe. Der Parameter radius steuert die Glättung (größer = glatter).
  • Punktekarte – eingefärbt nach Objekttyp (property_type)
    -> Zeigt den räumlichen Mix der Unterkunftsarten und mögliche Typ-Cluster.

Interpretationshinweise

  • Bei Punktkarten können sich Marker überlagern; die Dichtekarte ergänzt hier den Überblick.
  • Farbskalen unterstützen das Erkennen von Hotspots und Strukturunterschieden zwischen Stadtteilen.
  • Für belastbare Schlüsse mit Aggregaten (z. B. Median-Preis je Borough) gegenprüfen.

Vorbereitung der Daten¶

Einlesen der Daten

  • Reine Index-Spalte "Unnamed: 0" droppen
In [248]:
file_Rome = "data/Rome/listings_clean.csv" 
df_Rome = pd.read_csv(file_Rome, low_memory=False)

df_Rome = df_Rome.drop(columns="Unnamed: 0", axis=1, errors="ignore")

df_Rome.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36309 entries, 0 to 36308
Data columns (total 49 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            36309 non-null  int64  
 1   name                          36309 non-null  object 
 2   host_id                       36309 non-null  int64  
 3   host_since                    36309 non-null  object 
 4   host_verifications            36309 non-null  object 
 5   host_identity_verified        36309 non-null  bool   
 6   neighbourhood_cleansed        36309 non-null  object 
 7   neighbourhood_group_cleansed  17656 non-null  object 
 8   latitude                      36309 non-null  float64
 9   longitude                     36309 non-null  float64
 10  property_type                 36309 non-null  object 
 11  room_type                     36309 non-null  object 
 12  accommodates                  36309 non-null  int64  
 13  bathrooms                     36309 non-null  float64
 14  bedrooms                      36309 non-null  float64
 15  amenities                     36309 non-null  object 
 16  price                         36309 non-null  float64
 17  minimum_nights                36309 non-null  int64  
 18  maximum_nights                36309 non-null  int64  
 19  minimum_minimum_nights        36299 non-null  float64
 20  maximum_minimum_nights        36299 non-null  float64
 21  minimum_maximum_nights        36299 non-null  float64
 22  maximum_maximum_nights        36299 non-null  float64
 23  minimum_nights_avg_ntm        36309 non-null  float64
 24  maximum_nights_avg_ntm        36309 non-null  float64
 25  has_availability              36309 non-null  bool   
 26  availability_30               36309 non-null  int64  
 27  availability_60               36309 non-null  int64  
 28  availability_90               36309 non-null  int64  
 29  availability_365              36309 non-null  int64  
 30  number_of_reviews             36309 non-null  int64  
 31  number_of_reviews_ltm         36309 non-null  int64  
 32  number_of_reviews_l30d        36309 non-null  int64  
 33  availability_eoy              36309 non-null  int64  
 34  number_of_reviews_ly          36309 non-null  int64  
 35  estimated_occupancy_l365d     36309 non-null  int64  
 36  first_review                  36309 non-null  object 
 37  last_review                   36309 non-null  object 
 38  review_scores_rating          31523 non-null  float64
 39  review_scores_accuracy        31516 non-null  float64
 40  review_scores_cleanliness     31517 non-null  float64
 41  review_scores_checkin         31516 non-null  float64
 42  review_scores_communication   31517 non-null  float64
 43  review_scores_location        31515 non-null  float64
 44  review_scores_value           31516 non-null  float64
 45  license                       36309 non-null  object 
 46  instant_bookable              36309 non-null  bool   
 47  reviews_per_month             31523 non-null  float64
 48  host_since_year               36309 non-null  int64  
dtypes: bool(3), float64(19), int64(16), object(11)
memory usage: 12.8+ MB
In [249]:
# Map center from data
center = dict(lat=df_Rome["latitude"].median(), lon=df_Rome["longitude"].median())

mask = df_Rome["price"] < 1500
sample = df_Rome[mask]

# Downsample for performance (adjust N as you like)
# sample = df_Rome.sample(min(80000, len(df_Rome)), random_state=42)
In [250]:
# Points map, colored by borough
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="neighbourhood_group_cleansed",
    hover_data=["name","neighbourhood_cleansed","room_type","price"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings by Borough")
fig.show()
No description has been provided for this image
In [251]:
# Points map, colored by price (continuous)
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="price",
    hover_data=["name","neighbourhood_cleansed","room_type","price"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Price")
fig.show()
No description has been provided for this image
In [252]:
# Density map (where listings cluster)
fig = px.density_map(
    sample,
    lat="latitude", lon="longitude",
    z=None, # or "price" to weight by price
    radius=8, # increase for smoother heat
    center=center, zoom=9, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listing Density (Heat)")
fig.show()
No description has been provided for this image
In [253]:
# Points map, colored by property_type
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="property_type",
    hover_data=["name","neighbourhood_cleansed","room_type","property_type"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Property type")
fig.show()
No description has been provided for this image
In [254]:
# Points map, colored by availability_365
fig = px.scatter_map(
    sample,
    lat="latitude", lon="longitude",
    color="availability_365",
    hover_data=["name","neighbourhood_cleansed","room_type","availability_365"],
    zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by availability_365")
fig.show()
No description has been provided for this image

Dash mit Diagramm und Schieberegler: Airbnb Rome – Jahresfilter: Verfügbarkeit & Preis¶

Ziel: Interaktive Punktekarte (availability_365 vs. price) mit Jahres-Slider.

Farbe = Borough (neighbourhood_group_cleansed), Punktgröße = number_of_reviews.

Schritte

  • Slider-Jahre: Aus host_since_year ermittelt.
  • Callback: Filter als eine kombinierte Bedingung.
  • Plot: px.scatter mit template="plotly_white", Farbe = Borough, Größe = Reviews, Hover zeigt room_type, bedrooms, bathrooms.

HTML-Export & Öffnen

  • Speichern als HTML (self-contained):
  • Link im Standard-Browser öffnen:
  • Link im Notebook einbetten:

Interpretation

  • Steigung/Form: Beziehung zwischen Verfügbarkeit (365 Tage) und Preis.
  • Farbcluster: Räumliche Segmente (Boroughs).
  • Punktgröße: Signalisiert Popularität/Aktivität über Reviews.
In [255]:
out_file_Rome = "airbnb_price_availability_yearslider_Rome.html"

# slider years
years = sorted([int(y) for y in df_Rome["host_since_year"].unique()])
initial_year = years[0]

app = Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id="graph-with-slider"),
    dcc.Slider(
        min=years[0], max=years[-1], step=None, value=initial_year,
        marks={str(y): str(y) for y in years},
        id="year-slider"
    )
])

@app.callback(
    Output("graph-with-slider", "figure"),
    Input("year-slider", "value")
)
def update_figure(selected_year):
    # align mask to df_Rome
    cond = (df_Rome["host_since_year"] == selected_year) & mask
    df_Romef = df_Rome.loc[cond]

    if df_Romef.empty:
        fig.update_layout(template="plotly_white")
        return fig

    # Example scatter: x=availability_36, y=price, color=borough, size=number_of_reviews
    fig = px.scatter(
        df_Romef,
        x="availability_365", y="price",
        color="neighbourhood_group_cleansed",
        size="number_of_reviews",
        hover_data=["room_type", "bedrooms", "bathrooms"],
        size_max=18,
        title=f"Listings – {selected_year}",
        template="plotly_white"
    )
    
    # Save as HTML
    fig.write_html(out_file_Rome, include_plotlyjs="cdn", full_html=True)
    
    fig.update_layout(transition_duration=500, legend_title_text="Borough")
    return fig

if __name__ == "__main__":
    app.run(debug=True, port=8053)
In [256]:
# Embed the link inside the notebook
# IFrame(src=out_file_Rome, width="100%", height=600)
In [257]:
# Open directly in the default browser
# webbrowser.open("file://" + os.path.abspath(out_file_Rome))

Dashboard: Tabelle und Histogramm mit Radio-Auswahl. Komponenten nebeneinander zeigen.¶

Ziel: Ein Dash-Layout, das eine Daten-Tabelle und ein Histogramm anzeigt. Per Radio-Buttons wählst du die Kennzahl, die im Plot aggregiert dargestellt wird.

Aufbau

  • Styling: Einbindung eines externen CSS (CodePen) für das Grid-Layout.
  • Kontrollelemente: dcc.RadioItems zur Auswahl der Metrik
    (Optionen: price, review_scores_rating, availability_365).
  • Inhalt in zwei Spalten:
    • Links: Dash Tabelle mit den Datensätzen.
    • Rechts: Dash Graph für das Histogramm.

Interaktion (Callback)

  • Input: Wert der Radio-Buttons.
  • Output: Aktualisiert die Figur.
  • Logik: px.histogram mit
    x='neighbourhood_group_cleansed',
    y=col_chosen,
    histfunc='avg' -> zeigt durchschnittliche Werte je Borough.

Start

  • Serverstart über app.run(debug=True, port=8054) und Aufruf im Browser (lokale URL).
In [258]:
# Initialize the app - incorporate css
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(external_stylesheets=external_stylesheets)


# App layout
app.layout = [
    html.Div(className='row', children='Tabelle, Kontrollelemente und Histogramm mit durchschnittliche Werte je Borough',
             style={'textAlign': 'center', 'color': 'blue', 'fontSize': 30}),

    html.Div(className='row', children=[
        dcc.RadioItems(options=['price', 'review_scores_rating', 'availability_365'],
                       value='review_scores_rating',
                       inline=True,
                       id='my-radio-buttons-final')
    ]),

    html.Div(className='row', children=[
        html.Div(className='six columns', children=[
            dash_table.DataTable(data=df_Rome.to_dict('records'), page_size=11, style_table={'overflowX': 'auto'})
        ]),
        html.Div(className='six columns', children=[
            dcc.Graph(figure={}, id='histo-chart-final')
        ])
    ])
]

# Add controls to build the interaction
@callback(
    Output(component_id='histo-chart-final', component_property='figure'),
    Input(component_id='my-radio-buttons-final', component_property='value')
)
def update_graph(col_chosen):
    fig = px.histogram(df_Rome, x='neighbourhood_group_cleansed', y=col_chosen, histfunc='avg')
    return fig

# Run the app
if __name__ == '__main__':
    app.run(debug=True, port=8054)

Donut-Kreischarts: Lizenzstatus New York vs. Rome¶

Ziel: Prozentuale Verteilung der Lizenzangaben je Stadt als Donut.

Gruppenbildung

  • New York (3 Sektoren):
    Registriert (OSE) = Werte beginnend mit OSE- · Exempt = genau Exempt · Unbenannt = leer/NaN/sonstiges.
  • Rom (4 Sektoren):
    CIN vorhanden = Werte beginnend mit IT · CIN beantragt = Text enthält „applied for a CIN“ · Exempt = genau Exempt · Unbenannt = leer/NaN/sonstiges.

Darstellung: Zwei Donuts nebeneinander (Plotly), Text = Prozent + Label, stadtspezifische Farben. Kategorien ohne Vorkommen werden automatisch weggelassen.

In [259]:
# Spalte "license" - Werte [nan, 'Exempt', "OSE-*"]
print(df_NY["license"].unique()[:5])
[i for i in df_NY["license"].unique() if str(i)[:4].upper() != "OSE-"]
['Unbenannt' 'OSE-STRREG-0000008' 'OSE-STRREG-0000923'
 'OSE-STRREG-0000656' 'OSE-STRREG-0000108']
Out[259]:
['Unbenannt', 'Exempt']
In [260]:
# Spalte "license" - Werte [nan, 'Exempt', "IT*"]
print(df_Rome["license"].unique()[:5])
[i for i in df_Rome["license"].unique() if str(i)[:2].upper() != "IT"]
['Unbenannt' 'IT058091C29VJSIZQZ' 'it058091c2kv6epw8f'
 'IT058091C20YD35BX2' 'IT058091B4I3RFDVLG']
Out[260]:
['Unbenannt', 'I applied for a CIN, but haven’t received it yet', 'Exempt']
In [261]:
# Donut-Kreischarts für Lizenz-Gruppen: New York (3 Sektoren) & Rome (4 Sektoren)

def group_license_ny(val):
    if pd.isna(val):
        return "Unbenannt"
    s = str(val).strip()
    up = s.upper()
    if up.startswith("OSE-"):          # z. B. OSE-STRREG-...
        return "Registriert (OSE)"
    if up == "EXEMPT":
        return "Exempt"
    if s == "" or up in {"UNBENANNT", "NAN"}:
        return "Unbenannt"
    # Alles, was nicht OSE-/Exempt ist, zu "Unbenannt" schieben (nur 3 Gruppen gewünscht)
    return "Unbenannt"

def group_license_rome(val):
    if pd.isna(val):
        return "Unbenannt"
    s = str(val).strip()
    up = s.upper()
    if up.startswith("IT"):  # z. B. IT058091...
        return "CIN vorhanden"
    if "APPLIED FOR A CIN" in up:  # "I applied for a CIN, but haven’t received it yet"
        return "CIN beantragt"
    if up == "EXEMPT":
        return "Exempt"
    if s == "" or up in {"UNBENANNT", "NAN"}:
        return "Unbenannt"
    # Rest ebenfalls zu "Unbenannt" (nur 4 Gruppen gewünscht)
    return "Unbenannt"

# Häufigkeiten berechnen 
ny_groups = df_NY["license"].apply(group_license_ny).value_counts()
rome_groups = df_Rome["license"].apply(group_license_rome).value_counts()

# Feste Reihenfolge (optional)
ny_order = ["Registriert (OSE)", "Exempt", "Unbenannt"]
rome_order = ["CIN vorhanden", "CIN beantragt", "Exempt", "Unbenannt"]

ny_labels = [g for g in ny_order if g in ny_groups.index]
ny_values = [int(ny_groups[g]) for g in ny_labels]

rome_labels = [g for g in rome_order if g in rome_groups.index]
rome_values = [int(rome_groups[g]) for g in rome_labels]

# Farben (3 für NY, 4 für Rom) 
ny_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]              # Blau/Orange/Grün
rome_colors = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"] # Blau/Rot/Grün/Lila

# Donut-Subplots zeichnen 
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "domain"}, {"type": "domain"}]],
    subplot_titles=("New York", "Rome")
)

fig.add_trace(
    go.Pie(
        labels=ny_labels, values=ny_values,
        hole=0.5, textinfo="percent+label",
        marker=dict(colors=ny_colors[:len(ny_labels)])
    ),
    1, 1
)

fig.add_trace(
    go.Pie(
        labels=rome_labels, values=rome_values,
        hole=0.5, textinfo="percent+label",
        marker=dict(colors=rome_colors[:len(rome_labels)])
    ),
    1, 2
)

fig.update_layout(
    title_text="Lizenzstatus: Verteilung von Lizenzgruppen",
    margin=dict(l=20, r=20, t=60, b=20),
    showlegend=False,
    uniformtext_minsize=10,
    uniformtext_mode="hide"
)

fig.show()
No description has been provided for this image

Natural Language Processing mit Spacy Bibliothek¶

Ausstattung Airbnb New York vs Airbnb Rome

Das Standardmodell ist "en_core_web_sm" (Englisch). Das Modell wird mit dem folgenden Befehl in der console heruntergeladen

In [262]:
# !python -m spacy download en_core_web_sm
In [263]:
nlp = spacy.load("en_core_web_sm")
In [264]:
df_NY["amenities"].values[:2]
Out[264]:
array(['["Dedicated workspace", "Essentials", "Cooking basics", "Stove", "Blender", "Dishes and silverware", "Hot water", "Dining table", "Dryer", "Washer", "Lock on bedroom door", "Kitchen", "Bathtub", "Oven", "Refrigerator", "Wine glasses", "Single level home", "Fire extinguisher", "Fast wifi \\u2013 330 Mbps", "Luggage dropoff allowed", "Microwave", "Bed linens", "Cleaning available during stay", "Conditioner", "Smoke alarm", "Elevator", "Body soap", "Clothing storage", "Self check-in", "Hair dryer", "Backyard", "Carbon monoxide alarm", "Freezer", "Heating", "Shower gel", "Coffee maker", "Paid parking off premises", "Extra pillows and blankets", "Cleaning products", "Long term stays allowed", "Free street parking", "Shampoo", "Laundromat nearby", "Lockbox"]',
       '["Dedicated workspace", "Essentials", "Cooking basics", "Stove", "Dishes and silverware", "TV", "Hot water", "Kitchen", "Keypad", "Bathtub", "Hangers", "Refrigerator", "Oven", "Fire extinguisher", "Luggage dropoff allowed", "Bed linens", "Cleaning available during stay", "Baking sheet", "Smoke alarm", "Self check-in", "Hair dryer", "Wifi", "Iron", "Carbon monoxide alarm", "Ethernet connection", "Heating", "Air conditioning", "Coffee maker", "Paid parking off premises", "Extra pillows and blankets", "Long term stays allowed", "Free street parking"]'],
      dtype=object)
In [265]:
df_Rome["amenities"].values[:2]
Out[265]:
array(['["Cleaning products", "Dryer", "Bed linens", "Outdoor dining area", "Hot tub", "Dining table", "Luggage dropoff allowed", "Oven", "Wifi", "Free street parking", "Microwave", "Free parking on premises", "Laundromat nearby", "Drying rack for clothing", "Kitchen", "Heating", "Freezer", "Coffee maker", "Patio or balcony", "Washer", "Hot water kettle", "Dishwasher", "Hangers", "Toaster", "Refrigerator", "Long term stays allowed", "Hot water", "Wine glasses", "Portable fans", "Dedicated workspace", "Cooking basics", "Hair dryer", "High chair", "Dishes and silverware", "First aid kit", "Elevator", "Bidet", "Blender", "Housekeeping - available at extra cost", "Air conditioning", "Stove", "Iron"]',
       '["Dryer", "Bed linens", "Luggage dropoff allowed", "Extra pillows and blankets", "Wifi", "Shampoo", "Kitchen", "Heating", "Coffee maker", "Hangers", "Ethernet connection", "Refrigerator", "Long term stays allowed", "Hot water", "Dedicated workspace", "Hair dryer", "Paid parking off premises", "Essentials", "Dishes and silverware", "Pocket wifi", "Elevator", "Window AC unit", "HDTV with standard cable", "Host greets you", "Free washer \\u2013 In unit", "Iron"]'],
      dtype=object)
In [266]:
df_NY["amenities"].values[5].replace("[", "").replace("]", "")
df_Rome["amenities"].values[5].replace("[", "").replace("]", "")
Out[266]:
'"Cleaning products", "Bed linens", "Dining table", "Extra pillows and blankets", "Shower gel", "Room-darkening shades", "Central air conditioning", "City skyline view", "Free street parking", "Smoking allowed", "Cleaning available during stay", "Microwave", "Crib - available upon request", "Books and reading material", "Shampoo", "Drying rack for clothing", "Laundromat nearby", "Kitchen", "Freezer", "Coffee maker", "Hot water kettle", "Paid parking garage on premises", "Clothing storage: closet and wardrobe", "Hangers", "Toaster", "Central heating", "Single level home", "Refrigerator", "Long term stays allowed", "Gas stove", "Hot water", "Wine glasses", "Dedicated workspace", "Cooking basics", "Hair dryer", "Paid parking off premises", "Essentials", "Coffee", "Pets allowed", "Fast wifi \\u2013 54 Mbps", "Dishes and silverware", "Fire extinguisher", "24 inch HDTV", "Elevator", "Ceiling fan", "Host greets you", "Body soap", "Blender", "Free washer \\u2013 In unit", "Iron"'
In [267]:
len(df_NY)//4, len(df_Rome)//4
Out[267]:
(9080, 9077)
In [268]:
text_NY = "".join([i.replace("[", "").replace("]", "") for i in df_NY["amenities"].values[:10000]])
len(text_NY) # 3328489
# Error: Text of length 3328489 exceeds maximum of 1000000
Out[268]:
3328489
In [269]:
text_Rome = "".join([i.replace("[", "").replace("]", "") for i in df_Rome["amenities"].values[:10000]])
len(text_Rome) # 5622364
# Error: Text of length 5622364 exceeds maximum of 1000000
Out[269]:
5622364
In [270]:
doc_NY = nlp(text_NY[:800000])
doc_Rome = nlp(text_Rome[:800000])
In [271]:
print([token.text for token in doc_NY][:100])
['"', 'Dedicated', 'workspace', '"', ',', '"', 'Essentials', '"', ',', '"', 'Cooking', 'basics', '"', ',', '"', 'Stove', '"', ',', '"', 'Blender', '"', ',', '"', 'Dishes', 'and', 'silverware', '"', ',', '"', 'Hot', 'water', '"', ',', '"', 'Dining', 'table', '"', ',', '"', 'Dryer', '"', ',', '"', 'Washer', '"', ',', '"', 'Lock', 'on', 'bedroom', 'door', '"', ',', '"', 'Kitchen', '"', ',', '"', 'Bathtub', '"', ',', '"', 'Oven', '"', ',', '"', 'Refrigerator', '"', ',', '"', 'Wine', 'glasses', '"', ',', '"', 'Single', 'level', 'home', '"', ',', '"', 'Fire', 'extinguisher', '"', ',', '"', 'Fast', 'wifi', '\\u2013', '330', 'Mbps', '"', ',', '"', 'Luggage', 'dropoff', 'allowed', '"', ',', '"']
In [272]:
print([token.text for token in doc_Rome][:100])
['"', 'Cleaning', 'products', '"', ',', '"', 'Dryer', '"', ',', '"', 'Bed', 'linens', '"', ',', '"', 'Outdoor', 'dining', 'area', '"', ',', '"', 'Hot', 'tub', '"', ',', '"', 'Dining', 'table', '"', ',', '"', 'Luggage', 'dropoff', 'allowed', '"', ',', '"', 'Oven', '"', ',', '"', 'Wifi', '"', ',', '"', 'Free', 'street', 'parking', '"', ',', '"', 'Microwave', '"', ',', '"', 'Free', 'parking', 'on', 'premises', '"', ',', '"', 'Laundromat', 'nearby', '"', ',', '"', 'Drying', 'rack', 'for', 'clothing', '"', ',', '"', 'Kitchen', '"', ',', '"', 'Heating', '"', ',', '"', 'Freezer', '"', ',', '"', 'Coffee', 'maker', '"', ',', '"', 'Patio', 'or', 'balcony', '"', ',', '"', 'Washer', '"', ',']
In [273]:
from collections import Counter

words_NY = [token.lemma_.lower() for token in doc_NY if not token.is_stop and not token.is_punct and not token.is_space]
len(words_NY)
Out[273]:
87591
In [274]:
words_Rome = [token.lemma_.lower() for token in doc_Rome if not token.is_stop and not token.is_punct and not token.is_space]
len(words_Rome)
Out[274]:
88840
In [275]:
# Häufigkeit für jedes Wort berechnen
word_freq_NY = Counter(words_NY)
# 10 häufigste Wörter auswählen
most_freq_words_NY = word_freq_NY.most_common(10)

for word in most_freq_words_NY:
    print(word)
('alarm', 2955)
('wifi', 1961)
('dryer', 1843)
('kitchen', 1719)
('hot', 1638)
('smoke', 1597)
('water', 1562)
('air', 1536)
('heating', 1470)
('allow', 1456)
In [276]:
# Häufigkeit für jedes Wort berechnen
word_freq_Rome = Counter(words_Rome)
# 10 häufigste Wörter auswählen
most_freq_words_Rome = word_freq_Rome.most_common(10)

for word in most_freq_words_Rome:
    print(word)
('hot', 1839)
('water', 1802)
('allow', 1624)
('parking', 1378)
('dryer', 1365)
('wifi', 1362)
('coffee', 1335)
('heating', 1322)
('alarm', 1318)
('essential', 1233)
In [277]:
import pandas as pd
tokens_NY = pd.DataFrame({"Token": [token.text for token in doc_NY],
              "Lemma": [token.lemma_ for token in doc_NY],
              "POS": [token.pos_ for token in doc_NY],
              "Tag": [token.tag_ for token in doc_NY],
              "Dep": [token.dep_ for token in doc_NY]})
tokens_NY[tokens_NY.POS == "NOUN"]
Out[277]:
Token Lemma POS Tag Dep
2 workspace workspace NOUN NN nmod
6 Essentials essential NOUN NNS nmod
11 basics basic NOUN NNS appos
23 Dishes dish NOUN NNS conj
25 silverware silverware NOUN NN conj
... ... ... ... ... ...
232724 maker maker NOUN NN appos
232729 parking parking NOUN NN compound
232731 premises premise NOUN NNS appos
232736 products product NOUN NNS appos
232742 unit unit NOUN NN appos

56868 rows × 5 columns

In [278]:
import pandas as pd
tokens_Rome = pd.DataFrame({"Token": [token.text for token in doc_Rome],
              "Lemma": [token.lemma_ for token in doc_Rome],
              "POS": [token.pos_ for token in doc_Rome],
              "Tag": [token.tag_ for token in doc_Rome],
              "Dep": [token.dep_ for token in doc_Rome]})
tokens_Rome[tokens_Rome.POS == "NOUN"]
Out[278]:
Token Lemma POS Tag Dep
2 products product NOUN NNS dep
10 Bed bed NOUN NN compound
11 linens linen NOUN NNS conj
16 dining dining NOUN NN compound
17 area area NOUN NN appos
... ... ... ... ... ...
230991 water water NOUN NN compound
230992 kettle kettle NOUN NN appos
230996 Hair hair NOUN NN compound
230997 dryer dryer NOUN NN appos
231002 aid aid NOUN NN appos

57915 rows × 5 columns

In [279]:
# Häufigkeit für jedes Noun berechnen
nouns_NY = tokens_NY[tokens_NY.POS == "NOUN"].Token
word_freq_NY = Counter(nouns_NY)
# 10 häufigste Wörter auswählen
most_freq_nouns_NY = word_freq_NY.most_common(10)

for word in most_freq_nouns_NY:
    print(word)
('alarm', 2955)
('Wifi', 1836)
('Smoke', 1597)
('water', 1562)
('parking', 1383)
('Heating', 1378)
('dryer', 1367)
('monoxide', 1359)
('TV', 1289)
('Air', 1256)
In [280]:
# Häufigkeit für jedes Noun berechnen
nouns_Rome = tokens_Rome[tokens_Rome.POS == "NOUN"].Token
word_freq_Rome = Counter(nouns_Rome)
# 10 häufigste Wörter auswählen
most_freq_nouns_Rome = word_freq_Rome.most_common(10)

for word in most_freq_nouns_Rome:
    print(word)
('water', 1801)
('parking', 1378)
('alarm', 1318)
('dryer', 1291)
('Hair', 1228)
('Essentials', 1227)
('Heating', 1157)
('Wifi', 1153)
('Dishes', 1056)
('Refrigerator', 1050)

Ausstattung Airbnb New York vs Ausstattung Airbnb Rome - Top-10 häufigste Nomen¶

NYC Wort NYC Häufigkeit Rom Wort Rom Häufigkeit
alarm 2955 water 1801
Wifi 1836 parking 1378
Smoke 1597 alarm 1318
water 1562 dryer 1291
parking 1383 Hair 1228
Heating 1378 Essentials 1227
dryer 1367 Heating 1157
monoxide 1359 Wifi 1153
TV 1289 Dishes 1056
Air 1256 Refrigerator 1050
In [ ]: